views:

210

answers:

5

How would I get a subset (say 100MB) of Wikipedia's pages? I've found you can get the whole dataset as XML but its more like 1 or 2 gigs; I don't need that much.

I want to experiment with implementing a map-reduce algorithm.

Having said that, if I could just find 100 megs worth of textual sample data from anywhere, that would also be good. E.g. the Stack Overflow database, if it's available, would possibly be a good size. I'm open to suggestions.

Edit: Any that aren't torrents? I can't get those at work.

+3  A: 

The stackoverflow database is available for download.

Alex
Pity it's a torrent, I can't get those at work.
Chris
Here's a link to the latest download: http://blog.stackoverflow.com/category/cc-wiki-dump/
Chris
+1  A: 

If you wanted to get a copy of the stackoverflow database, you could do that from the creative commons data dump.

Out of curiosity, what are you using all this data for?

Mike Cooper
I want to experiment with implementing a mapreduce algorithm
Chris
A: 

One option is to download the entire Wikipedia dump, and then use only part of it. You can either decompress the entire thing and then use a simple script to split the file into smaller files (e.g. here), or if you are worried about disk space, you can write a something a script that decompresses and splits on the fly, and then you can stop the decompressing process at any stage you want. Wikipedia Dump Reader can by your inspiration for decompressing and processing on the fly, if you're comfortable with python (look at mparser.py).

If you don't want to download the entire thing, you're left with the option of scarping. The Export feature might be helpful for this, and the wikipediabot was also suggested in this context.

Daphna Shezaf
Yeah, i'm in Australia, our internet download limits kinda preclude downloading the whole lot. Having said that, we're all getting fibre-to-the-home broadband (in a million years), and it'll send our country broke, so i could always wait for that? /rant
Chris
Right. Then look at the export feature. If I understand it correctly, it's less heavy on the servers and in bandwidth then crawling.
Daphna Shezaf
A: 

You could use a web crawler and scrape 100MB of data?

ben
Not too keen on punishing their servers that much!
Chris
+1  A: 

Chris, you could just write a small program to hit the Wikipedia "Random Page" link until you get 100MB of web pages: http://en.wikipedia.org/wiki/Special:Random. You'll want to discard any duplicates you might get, and you might also want to limit the number of requests you make per minute (though some fraction of the articles will be served up by intermediate web caches, not Wikipedia servers). But it should be pretty easy.

Jim Ferrans
You know, thats not a bad idea. It would give a nice subset. I'm worried that it'll simply take forever, that's my only issue.
Chris