views:

446

answers:

2

Let's start with a simple example. A HTTP data stream comes in the following format:

MESSAGE_LENGTH, 2 bytes
MESSAGE_BODY, 
REPEAT...

Currently, I use urllib2 to retrieve and process streaming data as below:

length = response.read(2)
while True:
    data = response.read(length)
    DO DATA PROCESSING

It works, but since all messages are in size of 50-100 bytes, the above method limits buffer size each time it reads so it may hurt performance.

Is it possible to use seperate threads for data retrieval and processing?

A: 

Yes, of course, and there are many different techniques to do so. You'll typically end up having a set of processes that only retrieves data, and increase the number of processes in that pool until you run out of bandwith, more or less. Those processes store the data somewhere, and then you have other processes or threads that pick the data up and process it from wherever it's stored.

So the answer to your question is "Yes", your next question is gonna be "How" and then the people who are really good at this stuff will want to know more. :-)

If you are doing this in a massive scale it can get very tricky, and you don't want them to step all over each other, and there are modules in Python that help you do all this. What the right way to do it is depends a lot on what scale we are talking, if you want to run this over multiple processors, or maybe even over completely separate machines, and how much data we are talking about.

I've only done it once, and on a not very massive scale, but ended up having once process that got a long list of urls that should be processed, and another process that took that list and dispatched it to a set of separate processes simply by putting files with URL's in them in separate directories that worked as "queues". The separate processes that fetched the URLs would look in their own queue-directory, fetch the URL and stick it into another "outqueue" directory, where I had another process that would dispatch those files into another set of queue-directories for the processing processes.

That worked fine, could be run of the network with NFS if necessary (although we never tried that) and could be scaled up to loads of processes on loads of machines if neeed (although we never did that either).

There may be more clever ways.

Lennart Regebro
+1  A: 

Yes, can be done and is not that hard, if your format is essentially fixed.

I used it with httplib in Python 2.2.3 and found it had some abysmal performance in the way we hacked it together (basically monkey patching a select() based socket layer into httplib).

The trick is to get the socket and do the buffering yourself, so you do not fight over buffering with the intermediate layers (made for horrible performance when we had httplib buffer for chunked http decoding, the socket layer buffer for read()).

Then have a statemachine that fetches new data from the socket when needed and pushes completed blocks into a Queue.Queue that feeds your processing threads.

I use it to transfer files, checksum (zlib.ADLER32) them in an extra thread and write them to the filesystem in a third thread. Makes for about 40 MB/s sustained throughput on my local machine via sockets and with HTTP/chunked overhead.

schlenk