views:

721

answers:

1
import urllib

print urllib.urlopen('http://www.reefgeek.com/equipment/Controllers_&_Monitors/Neptune_Systems_AquaController/Apex_Controller_&_Accessories/').read()

The above script works and returns the expected results while:

import urllib2

print urllib2.urlopen('http://www.reefgeek.com/equipment/Controllers_&_Monitors/Neptune_Systems_AquaController/Apex_Controller_&_Accessories/').read()

throws the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.5/urllib2.py", line 124, in urlopen
    return _opener.open(url, data)
  File "/usr/lib/python2.5/urllib2.py", line 387, in open
    response = meth(req, response)
  File "/usr/lib/python2.5/urllib2.py", line 498, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.5/urllib2.py", line 425, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.5/urllib2.py", line 360, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.5/urllib2.py", line 506, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found

Does anyone know why this is? I'm running this from laptop on my home network with no proxy settings - just straight from my laptop to the router then to the www.

+10  A: 

That URL does indeed result in a 404, but with lots of HTML content. urllib2 is handling it (correctly) as an error condition. You can recover the content of that site's 404 page like so:

import urllib2
try:
    print urllib2.urlopen('http://www.reefgeek.com/equipment/Controllers_&amp;_Monitors/Neptune_Systems_AquaController/Apex_Controller_&amp;_Accessories/').read()
except urllib2.HTTPError, e:
    print e.code
    print e.msg
    print e.headers
    print e.fp.read()
Jonathan Feinberg
that's good to know - out of curiosity, when I type this URL into my browser, it also works. Does this mean that the browser is also receiving a 404 but just displaying the content like urllib does?
Jerry
@Jerry Yes, that's what this means. You can verify this with Firebug or Safari/Chrome's Web Inspector.
Will McCutchen
I have firebug and I had checked it, but I didn't see anything that indicated a 404 - is there something special you have to do?Out of morbid curiosity, why do the browsers tolerate such poor standards? Why not just indicate that it couldn't find the file?Is this some type of trick the site it using to thwart bots - return a 404 with content knowing that browser will display the content and most bots will move on?
Jerry
It's returning 404 because they have a bug in their web site, I think. A 404 can have whatever content you wish. A legitimate 404, for example, might return a site directory or the results of a text search related to the URL you typed. The browsers are doing what they're supposed to do.
Jonathan Feinberg