views:

449

answers:

2

I am trying to screen scrape multiple pages of a website, that return an 'HTTP Error 500: Internal Server Error' response, but still give important data inside the error HTML.

Normally, I would fetch a page using this (Python 2.6.4):

import urllib2
url = "http://google.com"
data = urllib2.urlopen(url)
data = data.read()

But when attempting to use this on my current url, I get:

urllib2.HTTPError: HTTP Error 500: Internal Server Error

How can I fetch these error pages (with or without urllib2), all while they are returning Internal Server Errors?

A: 

If you mean you want to read the body of the 500:

request = urllib2.Request(url, data, headers)
try:
        resp = urllib2.urlopen(request)
        print resp.read()
except urllib2.HTTPError, error:
        print "ERROR: ", error

In your case, you don't need to build up the request. Just do

try:
        resp = urllib2.urlopen(url)
        print resp.read()
except urllib2.HTTPError, error:
        print "ERROR: ", error

so, you don't override urllib2.HTTPError, you just handle the exception.

sberry2A
No, I want to read the HTML the server would send to the user's browser if they accidentally went to one of the 500 internal error pages. Just like, if urllib broke on any 404 page (I'm not sure if it does, I haven't tried), I want to read the html the 404 page provides (E.G. if the site does a custom 404 page).
bball
+4  A: 

The HTTPError is a file-like object. You can catch it and then read its contents.

try:
    resp = urllib2.urlopen(url)
    contents = resp.read()
except urllib2.HTTPError, error:
    contents = error.read()
Joe Holloway
It works! Thank you!
bball