views:

253

answers:

5

For a NLP project of mine, I want to download a large number of pages (say, 10000) at random from Wikipedia. Without downloading the entire XML dump, this is what I can think of:

  1. Open a Wikipedia page
  2. Parse the HTML for links in a Breadth First Search fashion and open each page
  3. Recursively open links on the pages obtained in 2

In steps 2 and 3, I will quit, if I have reached the number of pages I want.

How would you do it? Please suggest better ideas you can think of.

ANSWER: This is my Python code:

# Get 10000 random pages from Wikipedia.
import urllib2
import os
import shutil
#Make the directory to store the HTML pages.
print "Deleting the old randompages directory"
shutil.rmtree('randompages')

print "Created the directory for storing the pages"
os.mkdir('randompages')

num_page = raw_input('Number of pages to retrieve:: ')

for i in range(0, int(num_page)):
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    infile = opener.open('http://en.wikipedia.org/wiki/Special:Random')

    page = infile.read()

    # Write it to a file.
    # TODO: Strip HTML from page
    f= open('randompages/file'+str(i)+'.html','w')
    f.write(page)
    f.close()

    print "Retrieved and saved page",i+1
+18  A: 
for i = 1 to 10000
    get "http://en.wikipedia.org/wiki/Special:Random"
Tommy Carlier
This can give duplicates.
SLaks
You can ignore the pages that you already downloaded.
Tommy Carlier
Though this can give duplicates, it won't probably matter much for me. +1 for the quick simple thought.
Amit
My comment above applies ;-) Wikipedia won't allow you, you'll end with 10000 error pages.
Khelben
A: 

You may be able to do an end run around most of the requirement:

http://cs.fit.edu/~mmahoney/compression/enwik8.zip

is a ZIP file containing 100 MB of Wikipedia, already pulled out for you. The linked file is ~ 16 MB in size.

Carl Smotricz
A: 

Look at the dbpedia project, e.g. here: http://wiki.dbpedia.org/Downloads34

There are small downloadable chunks with at least some article URLs. Once you parsed 10000, you can batch-download them carefully ...

The MYYN
+15  A: 

Wikipedia has an API. With this API you can get any random article in a given namespace:

http://en.wikipedia.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=5

and for each article you call also get the wiki text:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Main%20Page&rvprop=content
Pierre
+1 for get wiki text instead of HTML
Iamamac
Do you have any experience with using the API in Python? Any Python libraries?
Amit
The API returns data as JSON or XML. So I guess any language is able to parse this kind of structured data. You will also find many libraries here : http://www.mediawiki.org/wiki/API:Client_Code#Python
Pierre
Thanks Pierre. Your answer is correct too, but I accepted Tommy's answer since i did not need any modifications to my existing code for doing what I wanted to do. Sorry.
Amit
fair enough :-)
Pierre
+1  A: 

I'd go the opposite way-- start with the XML dump, and then throw away what you don't want.

In your case, if you are looking to do natural language processing, I would assume that you are interested in pages that have complete sentences, and not lists of links. If you spider the links in the manner you describe, you'll be hitting a lot of link pages.

And why avoid the XML, when you get the benefit of using XML parsing tools that will make your selection process easier?

Michael Dorfman
Because it's multiple terabytes, uncompressed.
Jason Orendorff