views:

78

answers:

1

I have a simple problem. I have to fetch a url (about once a minute), check if there is any new content, and if there is, post it to another url.

I have a working system with a cronjob every minute that basically:

for link in models.Link.objects.filter(enabled=True).select_related():
    # do it in two phases in case there is cross pollination

    # get posts
    twitter_posts, meme_posts = [], []
    if link.direction == "t2m" or link.direction == "both":
        twitter_posts = utils.get_twitter_posts(link)

    if link.direction == "m2t" or link.direction == "both":
        meme_posts = utils.get_meme_posts(link)

    # process them
    if len(twitter_posts) > 0:
        post_count += views.twitter_link(link, twitter_posts)

    if len(meme_posts) > 0:
        post_count += views.meme_link(link, meme_posts)

    count += 1

msg = "%s links crawled and %s posts updated" % (count, post_count)

This works great for the 150 users I have now, but the synchronousness of it scares me. I have url timeouts built-in, but at some point my cronjob will take > 1 minute, and I'll be left with a million of them running overwriting eachother.

So, how should I rewrite it?

Some issues:

  • I don't want to hit the APIs too hard incase they block me. So I'd like to have at most 5 open connections to any API at any time.
  • Users keep registering in the system as this runs, so I need some way to add them
  • I'd like this to scale as well as possible
  • I'd like to reuse as much existing code as I can

So, some thoughts I've had:

  • Spawn a thread for each link
  • Use python-twisted - Keep one running process, that the cronjob just makes sure is running.
  • Use stackless - Don't really know much about it.
  • Ask StackOverflow :)

How would you do this?

+3  A: 

Simplest: use a long-running process with sched (on its own thread) to handle the scheduling -- by posting requests to a Queue; have a fixed-size pool of threads (you can find a pre-made thread pool here, but it's easy to tweak it or roll your own) taking requests from the Queue (and returning results via a separate Queue). Registration and other system functions can be handled by a few more dedicated threads, if need be.

Threads aren't so bad, as long as (a) you never have to worry about synchronization among them (just have them communicate by intrinsically thread-safe Queue instances, never sharing access to any structure or subsystem that isn't strictly read-only), and (b) you never have too many (use a few dedicated threads for specialized functions, including scheduling, and a small thread-pool for general work -- never spawn a thread per request or anything like that, that will explode).

Twisted can be more scalable (at low hardware costs), but if you hinge your architecture on threading (and Queues) you have a built-in way to grow the system (by purchasing more hardware) to use the very similar multiprocessing module instead... almost a drop-in replacement, and a potential scaling up of orders of magnitude!-)

Alex Martelli
Ok thank you. Doing it, it still looks like my biggest blocker in IO connections (95% of my time is in the `read()` of a socket, and 4% is `connect()`). Would some sort of persistent connection help? If so, any implementation recommendation?
Paul Tarjan
Even if you were connecting to just one host, the best you could get with connection pools is a 4% speedup -- hardly worth the complexity (and you need to connect to many hosts anyway, right?). Just put enough threads in the pool to have enough simultaneous operations (rather than an arbitrary limit of 5, determine it empirically, start low but increase gradually if the numbers tell you that you need to). Tasks that spend that much time waiting for I/O from many sources are the most suitable ones for threading (though Twisted's even better, it's not by much!-).
Alex Martelli
Good point. It is only 2 different sources sadly, so I fear I'll get blocked by them if I add too many threads. Looks like I'm nearing my performance limitations...
Paul Tarjan
is `queue` better than `collections.deque` for this? And if my threads write state to the database (getting a new oauth token, etc) is that bad? Should I queue that operation (major refactor)?
Paul Tarjan
collections.deque is not thread-safe: Queue.Queue is a thread-safe wrapper on it. Some DB APIs are threadsafe (typically when each thread holds a separate connection or in some cases even just a separate cursor) -- you'll need to check the specific one you want to use. But putting the DB behind a dedicated thread with its own Queue is anything *but* a major refactor -- though large enough to warrant a separate question;-). But if you think THAT is major, then Twisted or anythink of that ilk is **Right Out**!-). Stackless can't help much w/I-O bound tasks, btw.
Alex Martelli
You've sure covered the gambit of technologies for me. Thanks! Here's a nice green checkmark.
Paul Tarjan
@Paul, tx, and always happy to be of assistance! Looking at your cmt 2 ago, with only 2 hosts it might indeed help (a little;-) to have two separate pools (each thread holding a persistent connection) of 5 threads each (for throttling), and correspondingly two separate Queues -- so at least one host being very slow does not slow down the other, while ensuring <= 5 requests "in flight" for an y host at a given time.
Alex Martelli
@Alex Martelli - I always learn something cool reading your answers. I'd never seen sched before. Thanks!
John Paulett
@John, you're welcome! sched is a cool module, especially useful for simulations (since you can prime it with "fake" versions of time.time and time.sleep) but nice for real scheduling too (often in its own thread since, per se, it doesn't do threading;-).
Alex Martelli