ansaurus

Question

Processing High-Volume Streaming Data with Twisted or using Threads, Queue in Python

Answer 1

A:

I suggest this organization:

one process reads Twitter, stuffs tweets into database
one or more processes reads database, processes each, inserts into new database. Original tweets either deleted or marked processed.

That is, you have two more more processes/threads. The tweet database could be seen as a queue of work. Multiple worker processes take jobs (tweets) off the queue, and create data in the second database.

shavenwarthog 2010-07-14 04:42:46

A database seems like overkill as a temporary receptacle.

Oddthinking 2010-07-18 15:30:53

Answer 2

A:

Here's simple setup if you are OK with using a single machine.

1 thread accepts connections. After a connection is accepted, it passes the accepted connection to another thread for processing.

You can, of course, use processes (e.g, using multiprocessing) instead of threads, but I'm not familiar with multiprocessing to give advice. The setup would be the same: 1 process accepts connections, then passes them to subprocesses.

If you need to shard the processing across multiple machines, then the simple thing to do would be to stuff the message into the database, then notify the workers about the new record (this will require some sort of coordination/locking between the workers). If you want to avoid hitting the database, then you'll have to pipe messages from your network process to the workers (and I'm not well versed enough in low level networking to tell you how to do that :))

Richard Levasseur 2010-07-17 22:54:23

Answer 3

A:

You should have a number of threads receiving the messages as they come in. That number should probably be 1 if you are using pycurl, but should be higher if you are using httplib - the idea being you want to be able to have more than one query on the Twitter API at a time, so there is a steady amount of work to process.

When each Tweet arrives, it is pushed onto a Queue.Queue. The Queue ensures that there is thread-safety in the communications - each tweet will only be handled by one worker thread.

A pool of worker threads is responsible for reading from the Queue and dealing with the Tweet. Only the interesting tweets should be added to the database.

As the database is probably the bottleneck, there is a limit to the number of threads in the pool that are worth adding - more threads won't make it process faster, it'll just mean more threads are waiting in the queue to access the database.

This is a fairly common Python idiom. This architecture will scale only to a certain degree - i.e. what one machine can process.

Oddthinking 2010-07-18 15:39:00

ansaurus

tags:

views:

answers:

Processing High-Volume Streaming Data with Twisted or using Threads, Queue in Python

related questions