ansaurus

Question

Determining number of sites on a website in python

Answer 1

+3 A:

First, make sure that scraping their site is legal.

Second, notice that when a document is not present, the HTML file contains:

<title>Application Error</title>

Third, use urllib to iterate over all the things you want to:

for p in range(1,7):
 for y in range(2000, 2011):
  doc = 1
  while True:
    # use urllib to open the url: (root)+p+y+doc
    # if the HTML has the string "application error" break from the while
    doc+=1

Arrieta 2010-07-09 05:45:12

Thanks, very helpful! The site is public (these are our elected parliamentarians after all :)) so i guess the legal aspect should not be an issue.

Thomas Jensen 2010-07-12 10:59:47

Answer 2

+1 A:

Here's a slightly more complete (but hacky) example which seems to work(using urllib2) - I'm sure you can customise it for your specific needs.

I'd also repeat Arrieta's warning about making sure the site's owner doesn't mind you scraping it's content.

#!/usr/bin/env python
import httplib2
h = httplib2.Http(".cache")

parliament = "A7"
year = 2010

#Create two lists, one list of URLs and one list of document numbers.
urllist = []
doclist = []

urltemplate = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&amp;mode=XML&amp;reference=%s-%d-%04u&amp;language=EN"

for document in range(0,9999):
    url = urltemplate % (parliament,year,document)
    resp, content = h.request(url, "GET")
    if content.find("Application Error") == -1:
        print "Document %04u exists" % (document)    
        urllist.append(urltemplate % (parliament,year,document))
        doclist.append(document)
    else:
        print "Document %04u doesn't exist" % (document)
print "Parliament %s, year %u has %u documents" % (parliament,year,len(doclist))

Jon Mills 2010-07-09 06:13:39

Thanks Jon for the very elaborate answer, this is great stuff for a guy learning python!

Thomas Jensen 2010-07-12 11:00:15

Answer 3

+1 A:

Here is a solution, but adding some timeout between request is a good idea:

import urllib
URL_TEMPLATE="http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&amp;mode=XML&amp;reference=A7-%d-%.4d&amp;language=EN"
maxRange=300

for year in [2010, 2011]:
    for page in range(1,maxRange):
        f=urllib.urlopen(URL_TEMPLATE%(year, page))
        text=f.read()
        if "<title>Application Error</title>" in text:
            print "year %d and page %.4d NOT found" %(year, page)
        else:
            print "year %d and page %.4d FOUND" %(year, page)
        f.close()

zoli2k 2010-07-09 06:18:30

Thanks, a lot, all the answers here are really nice examples!

Thomas Jensen 2010-07-12 10:58:51

ansaurus

tags:

views:

answers:

Determining number of sites on a website in python

related questions