how can i determine if anything at the given url does exist in the web using python? it can be a html page or a pdf file, shouldnt be matter. ive tried the solution written in this page http://code.activestate.com/recipes/101276/ but it just returns a 1 when its a pdf file or anything.
Send a HEAD request
import httplib
connection = httplib.HTTPConnection(url)
connection.request('HEAD', '/')
response = connection.getresponse()
if response.status == 200:
print "Resource exists"
The httplib
in that example is using HTTP/1.0 instead of 1.1, and as such Slashdot is returning a status code 301 instead of 200. I would recommend using urllib2
, and also probably checking for codes 20*
and 30*
.
The documentation for httplib
states:
It is normally not used directly — the module
urllib
uses it to handle URLs that use HTTP and HTTPS.[...]
The
HTTP
class is retained only for backward compatibility with 1.5.2. It should not be used in new code. Refer to the online docstrings for usage.
So yes. urllib
is the way to open URLs in Python — an HTTP/1.0 client won't get very far on modern web servers.
(Also, a PDF link works for me.)
This solution returns 1 because server is sending 200 OK response.
There's something wrong with your server. It should return 404 if the file doesn't exist.
You need to check HTTP response code. Python example:
from urllib2 import urlopen
code = urlopen("http://example.com/").code
4xx and 5xx code probably mean that you cannot get anything from this URL. 4xx status codes describe client errors (like "404 Not found") and 5xx status codes describe server errors (like "500 Internal server error"):
if (code / 100 >= 4):
print "Nothing there."
Links: