Freely available Large Datasets to try out Hadoop

Tired of searching for large datasets to practice Hadoop programming?

Though there are plenty of datasets available online, most of them require one or more of the following:  sign up for an account, access programmatically (make API calls),  write scripts to combine smaller datasets or even pay money.

Instead of spending our valuable time on looking for datasets, it would be well spent in getting more hands-on with Hadoop programming. So, here is a list of ready to use publicly available large datasets in different file formats for us to get hands-on with Hadoop:

Dataset: NYC Taxi Trips
Description: This dataset consists of 1,048,576 NYC taxi trip records of yellow taxis for the month of January 2016 collected by NYC’s Taxi and Limousine Commission. Trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. Detailed information about this dataset can be accessed at Trip Record Data.
Download URL:
File Type: CSV
File Size: 1.6 GB
Sample Hadoop Use Cases:
1) What’s the location with the most number of pickups made by yellow taxis in January 2016?
2) Which day of the week has the most number of trips made by yellow taxis in January 2016?


Dataset: Gutenberg Dataset
Description: Gutenberg dataset is a small subset of the Project Gutenberg corpus with a collection of 3,036 English books written by 142 authors. Detailed information about this dataset can be accessed at Gutenberg Dataset.
Download URL:
File Type: TXT
File Size: Compressed Zip: 440 MB
Uncompressed TXT: 1.12 GB
Sample Hadoop Use Cases:
1) What’s the top 10 most frequently used words in Gutenberg Corpus?
2) What’s the total wordcount of a book in Gutenberg Corpus?


Dataset: Stack Overflow Posts
Description: Stack Overflow is an online question and answer forum for computer programming with an user base of 6.7 million programmers. StackOverflow posts dataset contains 32,209,816 posts (around 32 million) made from 2008 till 2016. Records consists of fields like Post Type, Accepted Answer, Creation Date, Score, View Count, Post Contents(Body), Owner, Answer Count, Comment Count, Last Edited Date, Last Activity Date, Favorite Count etc.
Download URL:
File Type: XML
File Size: Compressed 7z archive: 9.6GB
Uncompressed Posts.xml: 45.6 GB
Sample Hadoop Use Cases:
1) What are the top 10 most answered questions in Stack Overflow posts?
2) What’s the percentage of Stack Overflow questions that went unanswered in 2015?


Dataset: Chemicals in Cosmetics
Description: Chemicals in Cosmetics data set is from the California Safe Cosmetics Program (CSCP) in the California Department of Public Health. It contains information on hazardous and potentially hazardous ingredients in cosmetic products sold in California. Records consists of the following fields: label names of cosmetic/personal care products, company/manufacturer names, product brand names, product categories, Chemical Abstracts Service registry numbers (CAS#) of the reported chemical ingredients, names of reported chemical ingredients, the number of reported chemicals for each product, and dates of reporting, product discontinuation or reformulation if applicable. Detailed information about this dataset can be accessed at
Download URL:
File Type: JSON
File Size: 34.3 MB
Sample Hadoop Use Cases:
1) What is the cosmetic product with most number of reported chemicals?
2) What are the top 5 chemical ingredients that are most reported in cosmetic products?


Dataset: NASA-HTTP Web Server Log
Description: NASA-HTTP Web Server Log data set contains all HTTP requests to the NASA Kennedy Space Center WWW server in Florida for the month of July 1995. Records consists of the following fields: host making the request, timestamp in the format “DAY MON DD HH:MM:SS YYYY”, request, HTTP reply code, bytes in the reply. Detailed information about this dataset can be accessed at
Download URL:
File Type: ASCII
File Size: Compressed GZ archive: 19.7 MB
Uncompressed ASCII: 205.2 MB
Sample Hadoop Use Cases:
1) What is the most frequently accessed page of the NASA-HTTP web server for the month of July 1995?
2) What is the time of the day during which NASA-HTTP web server received the most hits for the month of July 1995?


We hope, the datasets listed in this post is helpful to practice Hadoop programming. If you have any questions or comments regarding this blogpost or would like to share a publicly available ready to use dataset for Hadoop, please feel free to post it in the comment section below.

At ByteQuest, we are planning to offer face-to-face (in person) Big Data training courses in Bay Area, CA. If you are interested in enrolling, please click here to learn more. If you would like to receive our latest posts & updates on big data training directly in your email inbox, please subscribe. If you have any questions or suggestions for us, please feel free to contact us.

Profile photo of Sivagami Ramiah

About Sivagami Ramiah

Sivagami Ramiah is the founder and primary instructor with ByteQuest, the Big Data Training Institution, which stemmed from her passion for teaching Big Data and Machine Learning. She has 20 years of experience in software application development, majority of which was spent leading an Enterprise Application Development Team. As part of the Mining Massive Data Sets Graduate Certificate Program from Stanford University she had an opportunity to work on projects in Machine Learning and Social Network Analysis. In addition to being a Chief Instructor at ByteQuest, she is currently consulting for corporate clients in building end-to-end Industrial Internet of Things (IIoT) Solutions. She enjoys speaking in Tech Meetups. In her spare time, she loves working on applying Machine Learning Algorithms on Kaggle Open Data Sets.

0 responses on "Freely available Large Datasets to try out Hadoop"

Leave a Message

Your email address will not be published. Required fields are marked *


ByteQuest is a Big Data and Machine Learning Training institution helping teach the next generation of Data Engineers and Data Scientists.