views:

106

answers:

3

I'm about to write some example applications and accompanying documents comparing ways of accessing information stored in relational databases. To demonstrate real-life requirements, I need to include a realistic dataset of hundreds of thousands of facts.

Is anyone aware of publicly available, free datasets of that magnitude, of datasets of human names with human-level variance, or hierarchical datasets of either large organizational hierarchies, or large hierarchical, categorized, product catalogues?

Please point me in the right direction, if you are.


Part 1, human names: http://timecenter.cs.aau.dk/software.htm

Part 2, hierarchical data: no answer yet

+1  A: 

The wikipedia dump is pretty massive: obligatory wikipedia link.

ChristopheD
+2  A: 

http://dev.mysql.com/doc/sakila/en/sakila.html

David Stratton
This led me to http://dev.mysql.com/doc/#sampledb which has several promising leads. Thank you.
mikaelhg
Further examination led to http://timecenter.cs.aau.dk/software.htm which has a pretty nice simulated employee database, which MySQL uses in its own sample databases.
mikaelhg
+1  A: 

Your own PC's directory tree is a large hierarchical structure with lots of facts. You probably have a few thousand "Facts" which are file names, modification dates, sizes, extra OS info, etc., etc.

If that's not large enough, find a server that you can login to. That will be larger.

Not large enough? Get a web crawler and start crawling a big web site. That can be as large as you have the patience to crawl.

S.Lott
Merely having a bunch of hierarchical node link data will not serve the purpose of helping the reader understand how a specific demonstrated solution provides the wanted results. For that, the data must provide the reader with understandable hierarchical context, such as an organizational hierarchy, or the categorical hierarchy of a tools catalog.
mikaelhg
A filesystem is a standard, widely understood, almost universal "hierarchical context". It seems far more universal than organizations or a tools catalog.
S.Lott