For a NLP project of mine, I want to download a large number of pages (say, 10000) at random from Wikipedia. Without downloading the entire XML dump, this is what I can think of:
- Open a Wikipedia page
- Parse the HTML for links in a Breadth First Search fashion and open each page
- Recursively open links on the pages obtained in 2
In steps 2 and 3, I will quit, if I have reached the number of pages I want.
How would you do it? Please suggest better ideas you can think of.
ANSWER: This is my Python code:
# Get 10000 random pages from Wikipedia.
import urllib2
import os
import shutil
#Make the directory to store the HTML pages.
print "Deleting the old randompages directory"
shutil.rmtree('randompages')
print "Created the directory for storing the pages"
os.mkdir('randompages')
num_page = raw_input('Number of pages to retrieve:: ')
for i in range(0, int(num_page)):
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/wiki/Special:Random')
page = infile.read()
# Write it to a file.
# TODO: Strip HTML from page
f= open('randompages/file'+str(i)+'.html','w')
f.write(page)
f.close()
print "Retrieved and saved page",i+1