ansaurus

Question

Getting a large number (but not all) Wikipedia pages

Answer 1

+18 A:

for i = 1 to 10000
    get "http://en.wikipedia.org/wiki/Special:Random"

Tommy Carlier 2010-01-03 13:49:43

This can give duplicates.

SLaks 2010-01-03 13:53:25

You can ignore the pages that you already downloaded.

Tommy Carlier 2010-01-03 13:54:55

Though this can give duplicates, it won't probably matter much for me. +1 for the quick simple thought.

Amit 2010-01-03 13:57:12

My comment above applies ;-) Wikipedia won't allow you, you'll end with 10000 error pages.

Khelben 2010-01-04 10:08:18

Answer 2

A:

You may be able to do an end run around most of the requirement:

http://cs.fit.edu/~mmahoney/compression/enwik8.zip

is a ZIP file containing 100 MB of Wikipedia, already pulled out for you. The linked file is ~ 16 MB in size.

Carl Smotricz 2010-01-03 13:51:44

Answer 3

A:

Look at the dbpedia project, e.g. here: http://wiki.dbpedia.org/Downloads34

There are small downloadable chunks with at least some article URLs. Once you parsed 10000, you can batch-download them carefully ...

The MYYN 2010-01-03 13:51:58

Answer 4

+15 A:

Wikipedia has an API. With this API you can get any random article in a given namespace:

http://en.wikipedia.org/w/api.php?action=query&amp;list=random&amp;rnnamespace=0&amp;rnlimit=5

and for each article you call also get the wiki text:

http://en.wikipedia.org/w/api.php?action=query&amp;prop=revisions&amp;titles=Main%20Page&amp;rvprop=content

Pierre 2010-01-03 13:52:34

+1 for get wiki text instead of HTML

Iamamac 2010-01-03 13:56:48

Do you have any experience with using the API in Python? Any Python libraries?

Amit 2010-01-03 14:33:30

The API returns data as JSON or XML. So I guess any language is able to parse this kind of structured data. You will also find many libraries here : http://www.mediawiki.org/wiki/API:Client_Code#Python

Pierre 2010-01-03 14:39:00

Thanks Pierre. Your answer is correct too, but I accepted Tommy's answer since i did not need any modifications to my existing code for doing what I wanted to do. Sorry.

Amit 2010-01-04 09:22:52

fair enough :-)

Pierre 2010-01-04 09:32:50

Answer 5

+1 A:

I'd go the opposite way-- start with the XML dump, and then throw away what you don't want.

In your case, if you are looking to do natural language processing, I would assume that you are interested in pages that have complete sentences, and not lists of links. If you spider the links in the manner you describe, you'll be hitting a lot of link pages.

And why avoid the XML, when you get the benefit of using XML parsing tools that will make your selection process easier?

Michael Dorfman 2010-01-03 13:53:03

Because it's multiple terabytes, uncompressed.

Jason Orendorff 2010-01-03 15:26:03

ansaurus

tags:

views:

answers:

Getting a large number (but not all) Wikipedia pages

related questions