tags:

views:

1247

answers:

6

I have a Python web client that uses urllib2. It is easy enough to add HTTP headers to my outgoing requests. I just create a dictionary of the headers I want to add, and pass it to the Request initializer.

However, other "standard" HTTP headers get added to the request as well as the custom ones I explicitly add. When I sniff the request using Wireshark, I see headers besides the ones I add myself. My question is how do a I get access to these headers? I want to log every request (including the full set of HTTP headers), and can't figure out how.

any pointers?

in a nutshell: How do I get all the outgoing headers from an HTTP request created by urllib2?

A: 

see urllib2.py:do_request (line 1044 (1067)) and urllib2.py:do_open (line 1073) (line 293) self.addheaders = [('User-agent', client_version)] (only 'User-agent' added)

Mykola Kharechko
A: 

It should send the default http headers (as specified by w3.org) alongside the ones you specify. You can use a tool like WireShark if you would like to see them in their entirety.

Edit:

If you would like to log them, you can use WinPcap to capture packets sent by specific applications (in your case, python). You can also specify the type of packets and many other details.

-John

John T
I need to log them from inside my Python program so WinPcap won't help me. thanks though.
Corey Goldberg
Yes it will, have you even read what it is or how to use it? It's used with the wireshark program itself, which shows you analyzed output of packets and has the ability to log them.
John T
the packets contain the headers, I thought that was obvious. You could invoke/incorporate winpcap in your application.
John T
winpcap is for windows. my application runs all platforms. It is also too much overhead. thanks for the suggestion though.
Corey Goldberg
+4  A: 

The urllib2 library uses OpenerDirector objects to handle the actual opening. Fortunately, the python library provides defaults so you don't have to. It is, however, these OpenerDirector objects that are adding the extra headers.

To see what they are after the request has been sent (so that you can log it, for example):

req = urllib2.Request(url='http://google.com')
response = urllib2.urlopen(req)
print req.unredirected_hdrs

(produces {'Host': 'google.com', 'User-agent': 'Python-urllib/2.5'} etc)

The unredirected_hdrs is where the OpenerDirectors dump their extra headers. Simply looking at req.headers will show only your own headers - the library leaves those unmolested for you.

If you need to see the headers before you send the request, you'll need to subclass the OpenerDirector in order to intercept the transmission.

Hope that helps.

EDIT: I forgot to mention that, once the request as been sent, req.header_items() will give you a list of tuples of ALL the headers, with both your own and the ones added by the OpenerDirector. I should have mentioned this first since it's the most straightforward :-) Sorry.

EDIT 2: After your question about an example for defining your own handler, here's the sample I came up with. The concern in any monkeying with the request chain is that we need to be sure that the handler is safe for multiple requests, which is why I'm uncomfortable just replacing the definition of putheader on the HTTPConnection class directly.

Sadly, because the internals of HTTPConnection and the AbstractHTTPHandler are very internal, we have to reproduce much of the code from the python library to inject our custom behaviour. Assuming I've not goofed below and this works as well as it did in my 5 minutes of testing, please be careful to revisit this override if you update your Python version to a revision number (ie: 2.5.x to 2.5.y or 2.5 to 2.6, etc).

I should therefore mention that I am on Python 2.5.1. If you have 2.6 or, particularly, 3.0, you may need to adjust this accordingly.

Please let me know if this doesn't work. I'm having waaaayyyy too much fun with this question:

import urllib2
import httplib
import socket


class CustomHTTPConnection(httplib.HTTPConnection):

    def __init__(self, *args, **kwargs):
        httplib.HTTPConnection.__init__(self, *args, **kwargs)
        self.stored_headers = []

    def putheader(self, header, value):
        self.stored_headers.append((header, value))
        httplib.HTTPConnection.putheader(self, header, value)


class HTTPCaptureHeaderHandler(urllib2.AbstractHTTPHandler):

    def http_open(self, req):
        return self.do_open(CustomHTTPConnection, req)

    http_request = urllib2.AbstractHTTPHandler.do_request_

    def do_open(self, http_class, req):
        # All code here lifted directly from the python library
        host = req.get_host()
        if not host:
            raise URLError('no host given')

        h = http_class(host) # will parse host:port
        h.set_debuglevel(self._debuglevel)

        headers = dict(req.headers)
        headers.update(req.unredirected_hdrs)
        headers["Connection"] = "close"
        headers = dict(
            (name.title(), val) for name, val in headers.items())
        try:
            h.request(req.get_method(), req.get_selector(), req.data, headers)
            r = h.getresponse()
        except socket.error, err: # XXX what error?
            raise urllib2.URLError(err)
        r.recv = r.read
        fp = socket._fileobject(r, close=True)

        resp = urllib2.addinfourl(fp, r.msg, req.get_full_url())
        resp.code = r.status
        resp.msg = r.reason

        # This is the line we're adding
        req.all_sent_headers = h.stored_headers
        return resp

my_handler = HTTPCaptureHeaderHandler()
opener = urllib2.OpenerDirector()
opener.add_handler(my_handler)
req = urllib2.Request(url='http://www.google.com')

resp = opener.open(req)

print req.all_sent_headers

shows: [('Accept-Encoding', 'identity'), ('Host', 'www.google.com'), ('Connection', 'close'), ('User-Agent', 'Python-urllib/2.5')]
Jarret Hardie
this is very helpful. however, I am not seeing *all* the headers still (like, Connection: Close)
Corey Goldberg
Hmmm.... would you mind posting how you're constructing the Request and how you're opening the connection please?
Jarret Hardie
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cookie_jar))request = urllib2.Request(url, None, headers)
Corey Goldberg
oops code doesn't look good in comments
Corey Goldberg
I don't think req.header_items() will include headers sent by the underlying HTTPConnection.
Justus
Justus is right. Specifically with 'Connection: Close'... the Opener has a method called 'do_open' where that gets added. It's added by a local variable in that function, which constructs a totally separate request object; that request object is thrown away at the end of the function scope
Jarret Hardie
In this case, I'm afraid you will have to write your own opener, since that default do_open function does its own socket initialization, so there's very little opportunity for you to inject anything, or observe anything via subclassing
Jarret Hardie
For inspiration, have a look at urllib2.py. You're looking for the AbstractHTTPHandler.do_open() function. Sorry I can't think of a more magic way to get that level of info
Jarret Hardie
OK... I should think before I comment :-) It's the handler, not the opener. You can use the build_opener() function you are already using, but provide your own Handler, rather than having to write an opener... this is not something I've done in some time.
Jarret Hardie
Jarret, see the JUSTUS answer, he almost nailed it, but it keep appending new headers if I call it over and over. any ideas to add to his solution?
Corey Goldberg
Jarret, any tip on providing my own handler that will do this. getting over my head :)
Corey Goldberg
Happy to help... this is a good question. I've updated by answer above.
Jarret Hardie
wow.. I was hoping it would be something easy. I'll give your approach a try (holy cow!). I think your code is too much to add to my codebase though :) This feature would be a great addition to urllib2 or httplib.
Corey Goldberg
Heh heh... yeah. I suppose you could look into using 'curl' from the shell if that's appropriate for your deployment env, or doing something much lower level with sockets... you'd lose a lot of sanity checking, but it would be shorter.
Jarret Hardie
+2  A: 

How about something like this:

import urllib2
import httplib

old_putheader = httplib.HTTPConnection.putheader
def putheader(self, header, value):
    print header, value
    old_putheader(self, header, value)
httplib.HTTPConnection.putheader = putheader

urllib2.urlopen('http://www.google.com')
Justus
this is VERY close to what I need. the only prob is when I call it in a loop, it keeps appending repeat headers.
Corey Goldberg
JUSTUS, this is so close.. can you update your answer if you have any other thoughts?
Corey Goldberg
I don't understand what you mean by "in a loop". But, given that this requires so much hackery I wonder why you need so much logging. You might be better off using an http proxy, have that do all the logging, and use urllib to talk to it.
Justus
well.. i have a load testing tool that sends HTTP requests repeatedly. It has a logging/debug mode where I would like to log the full HTTP requests and responses... including headers.
Corey Goldberg
+1  A: 

If you want to see the literal request text that is sent out, and therefore see every last header exactly as it is represented on the wire, then you can tell urllib to use your own version of an HTTPHandler that prints out (or saves, or whatever) the outgoing HTTP request.

import httplib, urllib

class MyHTTPConnection(httplib.HTTPConnection):
    def send(self, s):
        print s  # or save them, or whatever!
        httplib.HTTPConnection.send(self, s)

class MyHTTPHandler(urllib2.HTTPHandler):
    def http_open(self, req):
        return self.do_open(MyHTTPConnection, req)

opener = urllib2.build_opener(MyHTTPHandler)
response = opener.open('http://www.google.com/')

The result of running this code is:

GET / HTTP/1.1
Accept-Encoding: identity
Host: www.google.com
Connection: close
User-Agent: Python-urllib/2.6
Brandon Craig Rhodes
A: 

It sounds to me like you're looking for the headers of the response object, which include Connection: close, etc. These headers live in the object returned by urlopen. Getting at them is easy enough:

from urllib2 import urlopen
req = urlopen("http://www.google.com")
print req.headers.headers

req.headers is a instance of httplib.HTTPMessage

dcolish
nope.. was looking for request headers, not response headers
Corey Goldberg
Ah, well then you'll either need to create your own handler for HTTP requests that dumps this as the above examples might do, or if you are open to tweaking the stdlib, just drop in a log line in AbstractHTTPHandler.do_open that dumps headers.
dcolish