views:

293

answers:

4

I have a client that connects to an HTTP stream and logs the text data it consumes.

I send the streaming server an HTTP GET request... The server replies and continuously publishes data... It will either publish text or send a ping (text) message regularly... and will never close the connection.

I need to read and log the data it consumes in a non-blocking manner.

I am doing something like this:

import urllib2

req = urllib2.urlopen(url)    
for dat in req: 
    with open('out.txt', 'a') as f:        
        f.write(dat)

My questions are:
will this ever block when the stream is continuous?
how much data is read in each chunk and can it be specified/tuned?
is this the best way to read/log an http stream?

+2  A: 

Hey, that's three questions in one! ;-)

It could block sometimes - even if your server is generating data quite quickly, network bottlenecks could in theory cause your reads to block.

Reading the URL data using "for dat in req" will mean reading a line at a time - not really useful if you're reading binary data such as an image. You get better control if you use

chunk = req.read(size)

which can of course block.

Whether it's the best way depends on specifics not available in your question. For example, if you need to run with no blocking calls whatever, you'll need to consider a framework like Twisted. If you don't want blocking to hold you up and don't want to use Twisted (which is a whole new paradigm compared to the blocking way of doing things), then you can spin up a thread to do the reading and writing to file, while your main thread goes on its merry way:

def func(req):
    #code the read from URL stream and write to file here

...

t = threading.Thread(target=func)
t.start() # will execute func in a separate thread
...
t.join() # will wait for spawned thread to die

Obviously, I've omitted error checking/exception handling etc. but hopefully it's enough to give you the picture.

Vinay Sajip
+1  A: 

Yes when you catch up with the server it will block until the server produces more data

Each dat will be one line including the newline on the end

twisted is a good option

I would swap the with and for around in your example, do you really want to open and close the file for every line that arrives?

gnibbler
the for/with order was intentional. this will open/close the file handle with each write. Not efficient for a busy stream, but in my case the stream is mostly blocked/waiting and then occasionally receives data to log.
Corey Goldberg
+1  A: 

You're using too high-level an interface to have good control about such issues as blocking and buffering block sizes. If you're not willing to go all the way to an async interface (in which case twisted, already suggested, is hard to beat!), why not httplib, which is after all in the standard library? HTTPResponse instance .read(amount) method is more likely to block for no longer than needed to read amount bytes, than the similar method on the object returned by urlopen (although admittedly there are no documented specs about that on either module, hmmm...).

Alex Martelli
+1  A: 

Another option is to use the socket module directly. Establish a connection, send the HTTP request, set the socket to non-blocking mode, and then read the data with socket.recv() handling 'Resource temporarily unavailable' exceptions (which means that there is nothing to read). A very rough example is this:

import socket, time

BUFSIZE = 1024

s = socket.socket()
s.connect(('localhost', 1234))
s.send('GET /path HTTP/1.0\n\n')
s.setblocking(False)

running = True

while running:
    try:
        print "Attempting to read from socket..."
        while True:
            data = s.recv(BUFSIZE)
            if len(data) == 0:      # remote end closed
                print "Remote end closed"
                running = False
                break
            print "Received %d bytes: %r" % (len(data), data)
    except socket.error, e:
        if e[0] != 11:      # Resource temporarily unavailable
            print e
            raise

    # perform other program tasks
    print "Sleeping..."
    time.sleep(1)

However, urllib.urlopen() has some benefits if the web server redirects, you need URL based basic authentication etc. You could make use of the select module which will tell you when there is data to read.

mhawke