views:

120

answers:

2

I'm writing a basic html-proxy in python (3), and up to now I'm not using prebuild classes like http.server.

I'm just starting a socket which accepts connection:

self.listen_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
self.listen_socket.bind((socket.gethostname(), 4321))
self.listen_socket.listen(5)
(a, b) = self.listen_socket.accept()
content = a.recv(100000)

Now content stores data like:

b'GET http://www.google.com/firefox HTTP/1.1\r\nHost: www.google.com\r\nUser-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2) Gecko/20100207 Namoroka/3.6\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\nAccept-Language: en-us,en;q=0.5\r\nAccept-Encoding: gzip,deflate\r\nAccept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\nKeep-Alive: 115\r\nProxy-Connection: keep-alive\r\nCookie: PREF=ID=1ac935f4d893f655:U=73a4849dc5fc23a4:TM=1266851688:LM=1267023171:S=Log1PmXRMlNjX3Of; NID=32=EnrZjTqILuW2_aMLtgsJ96FdEMF3s5FoMJSVq9GMr9dhLhTAd3F5RcQ3ImyVBiO2eYNKKMhzlGg7r8zXmeSq50EigS5sdKtCL9BMHpgCxZazA2NiyB0bTRWhp8-0BObn\r\n\r\n'

How can I regexp it? Converting to string does not work for me.

Or, eventually, I need to find out the address which is inquired, like http://www.google.com/firefox in this case. Is there a parser that I do not know? How can I achieve the result?

Thanks in advance.

+3  A: 

You need to include an encoding when converting to a string, for example use:

>>> str(b'GET http://...', 'UTF-8')
'GET http://...'

If you don't use an encoding then as you've discovered you get something a little less helpful:

>>> str(b'GET http://...')
"b'GET http://...'"
Scott Griffiths
That seems to work. Can I assume 'UTF-8' default encoding for HTTP requests?
Enrico Carlesso
I don't think you can assume UTF-8, I think it can indicate other charsets (I'm no HTTP expert though).
Scott Griffiths
According to the standard, any non-ASCII characters in an HTTP header are ISO-8859-1. In practice, browsers differ. Firefox uses the low-byte of the UTF-16 code unit, Opera and Chrome use UTF-8, Safari generally breaks, and IE will use the system default code page of the machine it's installed on (which will never be UTF-8). In summary, unencoded non-ASCII characters in headers are totally unreliable. Probably you don't care though, in which can you can just plump for ISO-8859-1.
bobince
+1  A: 

Also, you might want to check the *HTTPServer classes. They provide a wrapper around being HTTP servers and will also parse headers for you.

If you can't, well, at the very least they will provide source code examples on how to do it!

Daren Thomas
Yes, I've noticed it, and I've got some plans to use it in the future, but now i do not need it.
Enrico Carlesso