For a NLP project of mine, I want to download a large number of pages (say, 10000) at random from Wikipedia. Without downloading the entire XML dump, this is what I can think of:
- Open a Wikipedia page
- Parse the HTML for links in a Breadth First Search fashion and open each page
- Recursively open links on the pages obtained in 2
In steps 2 and 3, I will quit, if I have reached the number of pages I want.
How would you do it? Please suggest better ideas you can think of.
ANSWER: This is my Python code:
# Get 10000 random pages from Wikipedia.
import urllib2
import os
import shutil
#Make the directory to store the HTML pages.
print "Deleting the old randompages directory"
shutil.rmtree('randompages')
print "Created the directory for storing the pages"
os.mkdir('randompages')
num_page = raw_input('Number of pages to retrieve:: ')
for i in range(0, int(num_page)):
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    infile = opener.open('http://en.wikipedia.org/wiki/Special:Random')
    page = infile.read()
    # Write it to a file.
    # TODO: Strip HTML from page
    f= open('randompages/file'+str(i)+'.html','w')
    f.write(page)
    f.close()
    print "Retrieved and saved page",i+1