Good machine learning research starts with an exceptional dataset. There is no need to spend your evening crafting your own set of data in MySQL or, god forbid, Excel. Basically, anything from COVID-19 stats to Harry Potter spells (made it myself!) exists in a form of a database. You just need to find it.
Let me help you – in this post, you will learn where to find datasets for machine learning research.
Top general ML dataset aggregators
Dataset aggregators collect thousands of databases for various purposes.
Kaggle, being updated by enthusiasts every day, has one of the largest dataset libraries online.
Kaggle is a community-driven machine learning platform. It contains plenty of tutorials that cover hundreds of different real-life ML problems. It is true that quality may vary. However, all the data is completely free. You can also upload your own dataset there.
2. Google Dataset Search
Dataset Search is a reliable source of information for your research. It is convenient to sort datasets by:
- file format,
- license type,
- time of last update.
The datasets here are uploaded by international organizations such as the World Health Organization, Statista, and Harvard.
3. Registry of Open Data on AWS
In the Registry of Open Data on AWS, anyone can share a dataset or find the one they need. You can do research based on the data you find with the help of Amazon data analytics tools. Among database creators, you will find Facebook Data for Good, NASA Space Act Agreement, and Space Telescope Science Institute.
4. Microsoft Azure Public Datasets
Azure Public Datasets have regularly updated databases for app developers and researchers. They contain U.S. Government data, other statistical and scientific data, and online service information that Microsoft collects about its users.
Moreover, Azure offers a collection of tools that help you create cloud databases of your own, migrate your SQL workloads to Azure while maintaining complete SQL Server compatibility, and build data-driven mobile and web applications.
In the datasets subreddit, anyone can publish their open-source databases. You can go there, find a cool dataset, and try to do something nice with it.
6. UCI Machine Learning Repository
UCI offers 507 datasets that cover bank marketing, car evaluation, lung cancer diagnosis, and many other different subjects. You can sort the databases by:
- default task,
- data type,
- area of application,
7. CMU Libraries
Carnegie Mellon University has its own collection of public datasets that you can use for your own research. There you will find insightful databases about American culture, music, and history that other aggregators don’t provide.
8. Awesome Public Datasets on Github
This is a great open-source collection of the best datasets available online divided by industry. Some of the libraries that you can find there I am going to mention later in this post.
Best public datasets for machine learning and data science
Domain-specific databases for real machine learning enthusiasts.
Before you change the world with your ML research, it can be fun just to practice. Here are some datasets that you can use for exploratory analysis. This is the practice of studying the data by trying to find patterns and anomalies and using this information to build ML models.
- Million Song Dataset can be used for exploratory analysis and building recommender systems. The database is 280 GB, but for test research, you can also download a smaller version of just 10, 000 songs, which is around 2GB.
- Game of Thrones dataset by Myles O’Neil on Kaggle will interest you if you’re a fan of George R.R. Martin’s A Song of Fire and Ice book series. It explores the deaths and battles of this fantasy world.
- LEGO Database by Rachael Tatman describes all the official LEGO parts/sets, their colors, and inventories.
- UFO Sightings by National UFO Reporting Center contains reports over all the unidentified flying objects sightings over the last century.
- World University Rankings by Myles O’Neil covers the world’s top universities and provides information about their rank for quality of education, alumni employment, influence, and other factors.
Deep learning is based on using artificial neural networks to solve tasks. Rather than writing an algorithm for the task, the programmer uses representation learning and allows the machine to make predictions by itself.
Image processing and object recognition for computer vision
- Google’s Open Images Dataset is very diverse and contains complex samples with several objects per image. It contains object bounding boxes, object segmentation, and labels to help you orient in more than 9 million pictures.
- VisualData is an aggregator of computer vision datasets where you can find medical datasets for machine learning, image datasets, and other cool machine learning data samples for business, educational, and other types of ML research.
- xView is one of the largest publicly available storages of overhead imagery. It contains images from complex scenes around the world, annotated using bounding boxes.
- If you are looking for a quality large-scale deep learning dataset, pay attention to Kinetics-700. It has video clips of different human-object and human-human interactions divided into classes.
- ImageNet is a set of images for deep computer vision with more than 1000 different classes built according to the WordNet hierarchy.
- Visual QA contains open-ended questions about more than 265,016 images. It can be used for a better understanding of computer vision modeling and language processing.
- The MNIST database is a collection of samples for handwritten digit recognition. It contains a training set of more than 60,000 examples and a test set of 10,000. On the website, you will also find a table that compares the effectiveness of different types of classifiers applied to this dataset. Even a beginner can use MNIST to train their deep learning model.
- CIFAR-10 is a collection of images for training deep learning computer vision algorithms. The data bank consists of 60000 32x32 color images in 10 classes, 6000 images in each class. If this is not enough, try the CIFAR-100 dataset.
- COCO is a regularly updated DB for object segmentation and recognition in context, sponsored by Microsoft, Facebook, and Mighty AI.
- Labeled Faces in the Wild is a dataset for training and testing face recognition models.
Natural language processing, text-to-speech, and speech generation
Making robots and voice interfaces is impossible without speech corpora. Use these datasets to build your solutions.
- VoxCeleb is an audio collection that you can use for deep learning tasks such as real-time natural language processing, voice recognition, and speech generation.
- On LibriSpeech, you will find about 1000 hours of 16kHz oral English speech derived from audiobooks.
- Free Spoken Digit Dataset can be used for. It consists of spoken digit recordings at 8kHz that are precisely trimmed. They have near minimal silence at the beginnings and ends. The dataset is open-source.
- Common Voice is an initiative by Mozilla that contains hundreds of thousands of records of human voice. Every visitor of the Common Voice website can contribute to their open human speech database recording their own voice.
Check out the post by Christopher Dossman on Medium for more audio datasets of different kinds (it even has an Arabic corpus!).
- WordNet is a lexical database that contains all parts of speech grouped into sets of synonyms. Such a structure makes it a fantastic tool for natural language processing and linguistic research.
- 20 Newsgroups is a dataset that consists of 18,000+ text documents from 20 different newsgroups including sports, technology, art, entertainment, etc.
- Sentiment140 is a dataset of tweets that can be used for sentiment analysis or TTS.
- On IMDB Reviews, you will find 50,000+ raw and preprocessed movie reviews for sentiment analysis with deep learning.
- Yelp Reviews contains user reviews, business information, and images that you can use for personal and academic purposes.
- The Wikipedia Corpus is a huge set of data with examples of written English texts – more than 4,5 million articles.
- If you are looking for a segmented text corpus where samples are grouped by the age of the writers, use The Blog Authorship Corpus. It contains posts of around 20,000 bloggers collected from blogger.com in 2004.
Other video and audio databases for deep learning
- YouTube 8M has more than 6 million videos, human-proved labels, and about 2,6 billion audio and visual features.
- There are millions of labeled 10-second sound clips selected from YouTube videos on AudioSet by Google.
- On FSB, you will find a multitude of sound samples ranging from human and animal sounds to music and mechanical noise.
- Free Music Archive is a dataset for music analysis.
Recommendation systems are vital for e-commerce businesses since they help to provide personalized experiences to customers.
- Amazon Product Data contains metadata and reviews on millions of items sold on Amazon. This is an incredible resource for anyone interested in recommender systems.
- MovieLens is a website that provides personalized movie recommendations to its users. They also have an open-source dataset you can use to train your model.
- Jester Collaborative Filtering Dataset has more than 4 million ratings of 100 jokes from 73,421 users. Laugh your socks off while doing your ML research.
For more niche recommender systems datasets, visit Shuai Zhang’s blog.
It’s impossible to cover every area where ML can be successfully applied. But I’ve collected some examples below to give you some ideas.
- MIMIC-III is an open-source anonymous dataset of health data of more than 40,000 critical care patients. Among the covered parameters are demographics, vital signs, laboratory tests, and medication intake.
- Google-Landmarks can be applied to landmark recognition and retrieval.
- To understand the stock market, it can be very useful to build AI software. EOD Stock Prices stores historical data about day stock prices, dividends, and splits for US stocks.
- Boston Housing Dataset where you will find data that concerns housing in the area of Boston Mass.
- Restaurants Health Score in San Francisco developed by the local Health Department provides interesting material for researchers interested in public health and restaurant business.
- For information about home prices and rents by size, type, and tier in the USA, visit Zillow Real Estate Research website.
- The World Bank Global Education Statistics Dataset contains data about 4,000+ internationally comparable indicators for education access and progress.
- Quandl is a resource to go if you are looking for financial and economical datasets for investment professionals.
There are so many datasets that the opportunities for ML research are truly endless. Explore Kaggle, Google Dataset Search, and other resources from the list to find what intrigues you. And check out the artificial intelligence section of our blog for more awesome materials.