ansaurus

Question

Reading HTTP server push streams with Python

Answer 1

A:

Do you need to actually parse the response headers, or are you mainly interested in the content? And is your HTTP request complex, making you set cookies and other headers, or will a very simple request suffice?

If you only care about the body of the HTTP response and don't have a very fancy request, you should consider simply using a socket connection:

import socket

SERVER_ADDR = ("example.com", 80)

sock = socket.create_connection(SERVER_ADDR)
f = sock.makefile("r+", bufsize=0)

f.write("GET / HTTP/1.0\r\n"
      + "Host: example.com\r\n"    # you can put other headers here too
      + "\r\n")

# skip headers
while f.readline() != "\r\n":
    pass

# keep reading forever
while True:
    line = f.readline()     # blocks until more data is available
    if not line:
        break               # we ran out of data!

    print line

sock.close()

Eli Courtwright 2010-04-26 21:05:16

This works for a bit (once I got the headers for a POST request right, anyway). However, after a few seconds, the connection seems to terminate and I get a "</div></body></html>" from the server and no further data. Is the kept-alive connection timing out or something along these lines, and if so, how do I stop it?

Sam 2010-04-26 22:12:03

@Sam: The fact that you're reading `</div></body></html>` implies to me that you're actually reaching the end of your output. Are you sure there's more? If so, then consider setting the `Connection: Keep-Alive` HTTP header: http://www.io.com/~maus/HttpKeepAlive.html

Eli Courtwright 2010-04-26 22:17:23

There's definitely more, because it comes up in my web browser reading the same stream. However, looking at the page source, a JavaScript runs every six seconds, changing the window.location to a POST request with different parameters; specifically, it changes "rnd=0.749976718186" to a different number. I have no idea what this does, but I suspect it's related to the stream terminating early. I'll have to speak to the owner of the stream and get back to you.

Sam 2010-04-26 22:31:05

Problem solved! The page I'm interfacing with requires another connection to be refreshed every 20 seconds or so or it kills the connection off because it thinks you've disconnected. Add code to grab that every few seconds and bingo, all worky. Thanks!

Sam 2010-04-26 23:31:54

Answer 2

A:

One way to do it using urllib2 is (assuming this site also requires Basic Auth):

 import urllib2
 p_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
 url = 'http://streamingsite.com'
 p_mgr.add_password(None, url, 'login', 'password')

 auth = urllib2.HTTPBasicAuthHandler(p_mgr)
 opener = urllib2.build_opener(auth)

 urllib2.install_opener(opener)
 f = opener.open('http://streamingsite.com')

 while True:
     data = f.readline()

rlotun 2010-04-26 21:10:53

This doesn't appear to work. I dropped the auth stuff 'cos I don't need it and just used an HTTPHandler. Also added a sleep() to the loop to stop it eating too much CPU, and printing to screen if any data is encountered. It runs through the contents of the stream as they exist when the script is started, and then doesn't get any further data.

Sam 2010-04-26 21:55:36

ansaurus

tags:

views:

answers:

Reading HTTP server push streams with Python

related questions