Freely available Large Datasets to try out Hadoop
Tired of searching for large datasets to practice Hadoop programming?
Though there are plenty of datasets available online, most of them require one or more of the following: sign up for an account, access programmatically (make API calls), write scripts to combine smaller datasets or even pay money.
Instead of spending our valuable time on looking for datasets, it would be well spent in getting more hands-on with Hadoop programming. So, here is a list of ready to use publicly available large datasets in different file formats for us to get hands-on with Hadoop:
Dataset: NYC Taxi Trips |
---|
Description: This dataset consists of 1,048,576 NYC taxi trip records of yellow taxis for the month of January 2016 collected by NYC’s Taxi and Limousine Commission. Trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. Detailed information about this dataset can be accessed at Trip Record Data. |
Download URL: https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2016-01.csv |
File Type: CSV |
File Size: 1.6 GB |
Sample Hadoop Use Cases: 1) What’s the location with the most number of pickups made by yellow taxis in January 2016? 2) Which day of the week has the most number of trips made by yellow taxis in January 2016? |
Dataset: Gutenberg Dataset |
---|
Description: Gutenberg dataset is a small subset of the Project Gutenberg corpus with a collection of 3,036 English books written by 142 authors. Detailed information about this dataset can be accessed at Gutenberg Dataset. |
Download URL: https://drive.google.com/file/d/0B2Mzhc7popBga2RkcWZNcjlRTGM/edit |
File Type: TXT |
File Size: Compressed Zip: 440 MB Uncompressed TXT: 1.12 GB |
Sample Hadoop Use Cases: 1) What’s the top 10 most frequently used words in Gutenberg Corpus? 2) What’s the total wordcount of a book in Gutenberg Corpus? |
Dataset: Stack Overflow Posts |
---|
Description: Stack Overflow is an online question and answer forum for computer programming with an user base of 6.7 million programmers. StackOverflow posts dataset contains 32,209,816 posts (around 32 million) made from 2008 till 2016. Records consists of fields like Post Type, Accepted Answer, Creation Date, Score, View Count, Post Contents(Body), Owner, Answer Count, Comment Count, Last Edited Date, Last Activity Date, Favorite Count etc. |
Download URL: https://archive.org/download/stackexchange/stackoverflow.com-Posts.7z |
File Type: XML |
File Size: Compressed 7z archive: 9.6GB Uncompressed Posts.xml: 45.6 GB |
Sample Hadoop Use Cases: 1) What are the top 10 most answered questions in Stack Overflow posts? 2) What’s the percentage of Stack Overflow questions that went unanswered in 2015? |
Dataset: Chemicals in Cosmetics |
---|
Description: Chemicals in Cosmetics data set is from the California Safe Cosmetics Program (CSCP) in the California Department of Public Health. It contains information on hazardous and potentially hazardous ingredients in cosmetic products sold in California. Records consists of the following fields: label names of cosmetic/personal care products, company/manufacturer names, product brand names, product categories, Chemical Abstracts Service registry numbers (CAS#) of the reported chemical ingredients, names of reported chemical ingredients, the number of reported chemicals for each product, and dates of reporting, product discontinuation or reformulation if applicable. Detailed information about this dataset can be accessed at https://www.healthdata.gov/dataset/chemicals-cosmetics |
Download URL: https://chhs.data.ca.gov/api/views/7kri-yb7t/rows.json?accessType=DOWNLOAD |
File Type: JSON |
File Size: 34.3 MB |
Sample Hadoop Use Cases: 1) What is the cosmetic product with most number of reported chemicals? 2) What are the top 5 chemical ingredients that are most reported in cosmetic products? |
Dataset: NASA-HTTP Web Server Log |
---|
Description: NASA-HTTP Web Server Log data set contains all HTTP requests to the NASA Kennedy Space Center WWW server in Florida for the month of July 1995. Records consists of the following fields: host making the request, timestamp in the format “DAY MON DD HH:MM:SS YYYY”, request, HTTP reply code, bytes in the reply. Detailed information about this dataset can be accessed at http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html |
Download URL: ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz |
File Type: ASCII |
File Size: Compressed GZ archive: 19.7 MB Uncompressed ASCII: 205.2 MB |
Sample Hadoop Use Cases: 1) What is the most frequently accessed page of the NASA-HTTP web server for the month of July 1995? 2) What is the time of the day during which NASA-HTTP web server received the most hits for the month of July 1995? |
We hope, the datasets listed in this post is helpful to practice Hadoop programming. If you have any questions or comments regarding this blogpost or would like to share a publicly available ready to use dataset for Hadoop, please feel free to post it in the comment section below.