How would I get a subset of Wikipedia's pages?

views:

210

answers:

How would I get a subset of Wikipedia's pages?

How would I get a subset (say 100MB) of Wikipedia's pages? I've found you can get the whole dataset as XML but its more like 1 or 2 gigs; I don't need that much.

I want to experiment with implementing a map-reduce algorithm.

Having said that, if I could just find 100 megs worth of textual sample data from anywhere, that would also be good. E.g. the Stack Overflow database, if it's available, would possibly be a good size. I'm open to suggestions.

Edit: Any that aren't torrents? I can't get those at work.

+3 A:

The stackoverflow database is available for download.

Alex 2009-08-24 04:29:18

Pity it's a torrent, I can't get those at work.

Chris 2009-08-24 04:41:10

Here's a link to the latest download: http://blog.stackoverflow.com/category/cc-wiki-dump/

Chris 2009-08-24 22:50:22

+1 A:

If you wanted to get a copy of the stackoverflow database, you could do that from the creative commons data dump.

Out of curiosity, what are you using all this data for?

Mike Cooper 2009-08-24 04:31:41

I want to experiment with implementing a mapreduce algorithm

Chris 2009-08-24 04:32:17

One option is to download the entire Wikipedia dump, and then use only part of it. You can either decompress the entire thing and then use a simple script to split the file into smaller files (e.g. here), or if you are worried about disk space, you can write a something a script that decompresses and splits on the fly, and then you can stop the decompressing process at any stage you want. Wikipedia Dump Reader can by your inspiration for decompressing and processing on the fly, if you're comfortable with python (look at mparser.py).

If you don't want to download the entire thing, you're left with the option of scarping. The Export feature might be helpful for this, and the wikipediabot was also suggested in this context.

Daphna Shezaf 2009-08-24 05:06:41

Yeah, i'm in Australia, our internet download limits kinda preclude downloading the whole lot. Having said that, we're all getting fibre-to-the-home broadband (in a million years), and it'll send our country broke, so i could always wait for that? /rant

Chris 2009-08-24 05:10:32

Right. Then look at the export feature. If I understand it correctly, it's less heavy on the servers and in bandwidth then crawling.

Daphna Shezaf 2009-08-24 07:41:39

You could use a web crawler and scrape 100MB of data?

ben 2009-08-24 05:08:59

Not too keen on punishing their servers that much!

Chris 2009-08-24 05:31:39

+1 A:

Chris, you could just write a small program to hit the Wikipedia "Random Page" link until you get 100MB of web pages: http://en.wikipedia.org/wiki/Special:Random. You'll want to discard any duplicates you might get, and you might also want to limit the number of requests you make per minute (though some fraction of the articles will be served up by intermediate web caches, not Wikipedia servers). But it should be pretty easy.

Jim Ferrans 2009-08-24 05:39:29

You know, thats not a bad idea. It would give a nice subset. I'm worried that it'll simply take forever, that's my only issue.

Chris 2009-08-24 06:28:18

ansaurus

tags:

views:

answers:

How would I get a subset of Wikipedia's pages?

related questions