views:

193

answers:

2

I'm playing around trying to write a client for a site which provides data as an HTTP stream (aka HTTP server push). However, urllib2.urlopen() grabs the stream in its current state and then closes the connection. I tried skipping urllib2 and using httplib directly, but this seems to have the same behaviour.

The request is a POST request with a set of five parameters. There are no cookies or authentication required, however.

Is there a way to get the stream to stay open, so it can be checked each program loop for new contents, rather than waiting for the whole thing to be redownloaded every few seconds, introducing lag?

A: 

Do you need to actually parse the response headers, or are you mainly interested in the content? And is your HTTP request complex, making you set cookies and other headers, or will a very simple request suffice?

If you only care about the body of the HTTP response and don't have a very fancy request, you should consider simply using a socket connection:

import socket

SERVER_ADDR = ("example.com", 80)

sock = socket.create_connection(SERVER_ADDR)
f = sock.makefile("r+", bufsize=0)

f.write("GET / HTTP/1.0\r\n"
      + "Host: example.com\r\n"    # you can put other headers here too
      + "\r\n")

# skip headers
while f.readline() != "\r\n":
    pass

# keep reading forever
while True:
    line = f.readline()     # blocks until more data is available
    if not line:
        break               # we ran out of data!

    print line

sock.close()
Eli Courtwright
This works for a bit (once I got the headers for a POST request right, anyway). However, after a few seconds, the connection seems to terminate and I get a "</div></body></html>" from the server and no further data. Is the kept-alive connection timing out or something along these lines, and if so, how do I stop it?
Sam
@Sam: The fact that you're reading `</div></body></html>` implies to me that you're actually reaching the end of your output. Are you sure there's more? If so, then consider setting the `Connection: Keep-Alive` HTTP header: http://www.io.com/~maus/HttpKeepAlive.html
Eli Courtwright
There's definitely more, because it comes up in my web browser reading the same stream. However, looking at the page source, a JavaScript runs every six seconds, changing the window.location to a POST request with different parameters; specifically, it changes "rnd=0.749976718186" to a different number. I have no idea what this does, but I suspect it's related to the stream terminating early. I'll have to speak to the owner of the stream and get back to you.
Sam
Problem solved! The page I'm interfacing with requires another connection to be refreshed every 20 seconds or so or it kills the connection off because it thinks you've disconnected. Add code to grab that every few seconds and bingo, all worky. Thanks!
Sam
A: 

One way to do it using urllib2 is (assuming this site also requires Basic Auth):

 import urllib2
 p_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
 url = 'http://streamingsite.com'
 p_mgr.add_password(None, url, 'login', 'password')

 auth = urllib2.HTTPBasicAuthHandler(p_mgr)
 opener = urllib2.build_opener(auth)

 urllib2.install_opener(opener)
 f = opener.open('http://streamingsite.com')

 while True:
     data = f.readline()
rlotun
This doesn't appear to work. I dropped the auth stuff 'cos I don't need it and just used an HTTPHandler. Also added a sleep() to the loop to stop it eating too much CPU, and printing to screen if any data is encountered. It runs through the contents of the stream as they exist when the script is started, and then doesn't get any further data.
Sam