tags:

views:

317

answers:

7
+7  Q: 

Large Data Sets

I'm always looking for large data sets to test various types of programs on. Does anyone have any suggestions?

+7  A: 

Check out the netflix contest. I believe they exposed their database, or a large subset, to facilitate the contest.

UPDATE: Their faq says they have 100 million entries in the subset you can download.

Mike Stone
+1  A: 

You might want to look at generating random data for Fuzz Testing. That would give you a pretty much unlimited amount of test data, and you're more likely to hit edge cases.

Maybe some more information on what kind of test data you want, what format, and for what types of applications?

Jon Galloway
+1  A: 

I don't know what your target platform is, but if you're developing against a MSSQL database check out Visual Studio for Database Professionals. It has a very cool feature where it can generate data for your schema using a data plan that you can define.

Redgate also has a datageneration tool, but I haven't used it.

The advantage is that you can create a data generation plan and use it to populate your database with consistent, large amounts of data which can be tuned to test specific areas of your schema.

lomaxx
+1  A: 

You might also want to check out theinfo by Aaron Swartz.

From the site

This is a site for large data sets and the people who love them: the scrapers and crawlers who collect them, the academics and geeks who process them, the designers and artists who visualize them. It's a place where they can exchange tips and tricks, develop and share tools together, and begin to integrate their particular projects.

cnu
+1  A: 

I've done some work with the Wikimedia download sets, which are huge XML files. Unfortunately, their download server appears to be currently having disk space issues so many of the data sets aren't available. But when it's available, the entire English Wikipedia data set with full history is 2.8 TB (18 GB compressed).

Greg Hewgill
+4  A: 

You might want to have a look at the data for the American Statistical Association data expo - it's flight details for all commercial flights in the US for the last 20 years - 120 million records, 11 gig of data.

hadley
+2  A: 

A number of del.icio.us users (including myself) tag pages that contain public data using the "publicdata" tag. You can find that archive here and subscribe to an RSS feed for that tag here. Subscribe to the feed and you'll see a steady stream of interesting datasets that pop up on the web.

Not all of those datasets are large, but they're often interesting.

Jeff Donnici