views:

122

answers:

3

Hi Everyone,

I am getting at extremely fast rate, tweets from a long-lived connection to the Twitter API Streaming Server. I proceed by doing some heavy text processing and save the tweets in my database.

I am using PyCurl for the connection and callback function that care of text processing and saving in the db. See below my approach who is not working properly.

I am not familiar with network programming, so would like to know: How can use Threads, Queue or Twisted frameworks to solve this problem ?

def process_tweet():
    # do some heaving text processing


def open_stream_connection():
    connect = pycurl.Curl()
    connect.setopt(pycurl.URL, STREAMURL)
    connect.setopt(pycurl.WRITEFUNCTION, process_tweet)
    connect.setopt(pycurl.USERPWD, "%s:%s" % (TWITTER_USER, TWITTER_PASS))
    connect.perform()
A: 

I suggest this organization:

  • one process reads Twitter, stuffs tweets into database
  • one or more processes reads database, processes each, inserts into new database. Original tweets either deleted or marked processed.

That is, you have two more more processes/threads. The tweet database could be seen as a queue of work. Multiple worker processes take jobs (tweets) off the queue, and create data in the second database.

shavenwarthog
A database seems like overkill as a temporary receptacle.
Oddthinking
A: 

Here's simple setup if you are OK with using a single machine.

1 thread accepts connections. After a connection is accepted, it passes the accepted connection to another thread for processing.

You can, of course, use processes (e.g, using multiprocessing) instead of threads, but I'm not familiar with multiprocessing to give advice. The setup would be the same: 1 process accepts connections, then passes them to subprocesses.

If you need to shard the processing across multiple machines, then the simple thing to do would be to stuff the message into the database, then notify the workers about the new record (this will require some sort of coordination/locking between the workers). If you want to avoid hitting the database, then you'll have to pipe messages from your network process to the workers (and I'm not well versed enough in low level networking to tell you how to do that :))

Richard Levasseur
A: 

You should have a number of threads receiving the messages as they come in. That number should probably be 1 if you are using pycurl, but should be higher if you are using httplib - the idea being you want to be able to have more than one query on the Twitter API at a time, so there is a steady amount of work to process.

When each Tweet arrives, it is pushed onto a Queue.Queue. The Queue ensures that there is thread-safety in the communications - each tweet will only be handled by one worker thread.

A pool of worker threads is responsible for reading from the Queue and dealing with the Tweet. Only the interesting tweets should be added to the database.

As the database is probably the bottleneck, there is a limit to the number of threads in the pool that are worth adding - more threads won't make it process faster, it'll just mean more threads are waiting in the queue to access the database.

This is a fairly common Python idiom. This architecture will scale only to a certain degree - i.e. what one machine can process.

Oddthinking