Data! I need more data!
Big data! I don’t know how many times lately I have read or heard that computer science students need to work with big data. But what is big data and where do you get it? If you have ever tried to build fake data you know it can be hard. This is especially true if you want the data to be “real” by some definition of real. Fortunately there is a huge amount of data on the Internet. The US Government has some great collections of data that are available in many formats that often include Excel, comma delimitated list text files, HTML and others. Below are a few of my favorite data sources.
The US Census bureau has several data sets including one about Popular surnames from 2000 Census that you can download and use.
- File A: Top 1000 Names [XLS – 132k]
- File B: Surnames Occurring 100 or more times [ZIP – 357k] (151,671 records)
For a list of Popular Baby Names More than the Top 1000 you can visit the Social Security website. There is other data there as well.
The Bureau of Labor Statistics has a lot of data including this helpful page of Databases, Tables & Calculators by Subject
The National Center for Education Statistics (NCES) has data related to education. They even have some tools for building your own custom data sets which can be downloaded in several formats.
Want some large text files for analysis and projects take a look at the large collection of free books at Project Gutenberg. There are books there in many languages by the way!
There are a couple of other links in the comments as I update this over the weekend. I really hope more of you will add your favorite online data sets. Thanks for the comments!
Comments
Anonymous
May 20, 2011
And tons here: http://www.data.gov/Anonymous
May 20, 2011
If you need lots, like 100+ GB, more unstructured data (i.e. emails) and that's legal/free/etc to use, there's Enron email data set. Available in PST, XML or text. coolthingoftheday.blogspot.com/.../and-even-more-enron-psts-that-is-were.html