ansaurus

Question

httplib in Python to get the status code...but it is too tricky?

Answer 1

A:

According to the spec you're supposed to split it up like that, maybe Python could abstract that out for you a bit, they're probably just giving you straight access to the header so you know exactly how it's being formatted, which is really the preference.

Myles 2009-11-13 20:26:46

Answer 2

+5 A:

Maybe you are better off using the URL library instead?

In Python 2, use urllib2:

>>> import urllib2
>>> url = urllib2.urlopen("http://www.google.com/index.html")
>>> url.getcode()
200

In Python 3, use urllib.request:

>>> import urllib.request
>>> url = urllib.request.urlopen("http://www.google.com/index.html")
>>> url.getcode()
200

Thomas 2009-11-13 20:28:55

+1, but as Yann notes below, this will download the whole page, not just the HEAD.

Stephan202 2009-11-13 20:33:50

Quite right; see also my other answer: http://stackoverflow.com/questions/1731657/httplib-in-python-to-get-the-status-code-but-it-is-too-tricky/1731800#1731800

Thomas 2009-11-13 20:47:33

@Stephan202, I tried Thomas's code with url `http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.31.6.tar.bz2`, and watched my network traffic with `vnstat -l`. I couldn't find any sign that the file was being downloaded. Can you explain your assertion? Thanks!

unutbu 2009-11-13 20:54:46

As soon as you call, e.g., `readlines` on the returned object, you'll see your download meter starting to tick. Maybe some OS buffer fills up quickly, and since you don't empty it, the kernel stops sending `ACK`s and the server stops sending data?

Thomas 2009-11-13 21:05:15

But Thomas, the code you posted in your answer doesn't use readlines(). Do you believe the code above downloads the whole page? I can't find any evidence of that.

unutbu 2009-11-13 21:21:01

No, as I wrote it, it doesn't download (much). But clearly, it doesn't close the connection either. If you do this a few thousand times, the kernel might get real unhappy.

Thomas 2009-11-13 21:34:14

Answer 3

+2 A:

The connect method takes a server argument (with an optional port). You have to split the connection with the resource you actually want.

For a simpler way to download web resources directly, you could go with urllib2 but urllib2 only supports GET or POST methods, no HEAD, so you end up downloading the whole resource.

Yann Schwartz 2009-11-13 20:29:32

Answer 4

A:

I like urllib2, sample code:

import urllib2
res = urllib2.urlopen('http://google.com/index.html')
res.getCode() #contains code

I something went wrong, you'll get an exception you can catch.

EDIT: Thanks, changes res.code to res.getCode() since the second one is documented

Johannes Weiß 2009-11-13 20:32:10

The `code` field is undocumented. You should probably use `getcode()` instead.

Thomas 2009-11-13 20:37:27

changed, thanks!

Johannes Weiß 2009-11-13 21:24:19

Answer 5

+3 A:

Alternatively, if you expect that actually downloading the data is problematic and you really need the HEAD method, you could parse the URL using urlparse:

>>> import httplib
>>> import urlparse
>>> url = "http://www.google.com/index.html"
>>> (scheme, netloc, path, params, query, fragment) = urlparse.urlparse(url)
>>> conn = httplib.HTTPConnection(netloc)
>>> conn.request("HEAD", urlparse.urlunparse(('', '', path, params, query, fragment))
>>> res = conn.getresponse()
>>> print res.status, res.reason
302 Found

And wrap this into a function taking the URL as an argument.

Thomas 2009-11-13 20:45:30

Answer 6

A:

Keep in mind that not all web servers support HEAD on each resource so you'll end up retrieving the resource anyway. You should write code accordingly.

Lawrence Oluyede 2009-11-13 20:58:06

ansaurus

tags:

views:

answers:

httplib in Python to get the status code...but it is too tricky?

related questions