views:

248

answers:

6
>>> import httplib
>>> conn = httplib.HTTPConnection("www.google.com")
>>> conn.request("HEAD", "/index.html")
>>> res = conn.getresponse()
>>> print res.status, res.reason
200 OK

This code will get the HTTP status code. However, notice that I split up "google.com" and "/index.html" on 2 lines.

And it's confusing.

What if I want to find the status code of just a general URL???

http://mydomain.com/sunny/boo.avi
http://anotherdomain.com/podcast.mp3
http://anotherdomain.com/rss/fee.xml

Can't I just stick the URL into it, and make it work?

Edit...I cannot use urllib, because I don't want to downlaod the file

A: 

According to the spec you're supposed to split it up like that, maybe Python could abstract that out for you a bit, they're probably just giving you straight access to the header so you know exactly how it's being formatted, which is really the preference.

Myles
+5  A: 

Maybe you are better off using the URL library instead?

In Python 2, use urllib2:

>>> import urllib2
>>> url = urllib2.urlopen("http://www.google.com/index.html")
>>> url.getcode()
200

In Python 3, use urllib.request:

>>> import urllib.request
>>> url = urllib.request.urlopen("http://www.google.com/index.html")
>>> url.getcode()
200
Thomas
+1, but as Yann notes below, this will download the whole page, not just the HEAD.
Stephan202
Quite right; see also my other answer: http://stackoverflow.com/questions/1731657/httplib-in-python-to-get-the-status-code-but-it-is-too-tricky/1731800#1731800
Thomas
@Stephan202, I tried Thomas's code with url `http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.31.6.tar.bz2`, and watched my network traffic with `vnstat -l`. I couldn't find any sign that the file was being downloaded. Can you explain your assertion? Thanks!
unutbu
As soon as you call, e.g., `readlines` on the returned object, you'll see your download meter starting to tick. Maybe some OS buffer fills up quickly, and since you don't empty it, the kernel stops sending `ACK`s and the server stops sending data?
Thomas
But Thomas, the code you posted in your answer doesn't use readlines(). Do you believe the code above downloads the whole page? I can't find any evidence of that.
unutbu
No, as I wrote it, it doesn't download (much). But clearly, it doesn't close the connection either. If you do this a few thousand times, the kernel might get real unhappy.
Thomas
+2  A: 

The connect method takes a server argument (with an optional port). You have to split the connection with the resource you actually want.

For a simpler way to download web resources directly, you could go with urllib2 but urllib2 only supports GET or POST methods, no HEAD, so you end up downloading the whole resource.

Yann Schwartz
A: 

I like urllib2, sample code:

import urllib2
res = urllib2.urlopen('http://google.com/index.html')
res.getCode() #contains code

I something went wrong, you'll get an exception you can catch.

EDIT: Thanks, changes res.code to res.getCode() since the second one is documented

Johannes Weiß
The `code` field is undocumented. You should probably use `getcode()` instead.
Thomas
changed, thanks!
Johannes Weiß
+3  A: 

Alternatively, if you expect that actually downloading the data is problematic and you really need the HEAD method, you could parse the URL using urlparse:

>>> import httplib
>>> import urlparse
>>> url = "http://www.google.com/index.html"
>>> (scheme, netloc, path, params, query, fragment) = urlparse.urlparse(url)
>>> conn = httplib.HTTPConnection(netloc)
>>> conn.request("HEAD", urlparse.urlunparse(('', '', path, params, query, fragment))
>>> res = conn.getresponse()
>>> print res.status, res.reason
302 Found

And wrap this into a function taking the URL as an argument.

Thomas
A: 

Keep in mind that not all web servers support HEAD on each resource so you'll end up retrieving the resource anyway. You should write code accordingly.

Lawrence Oluyede