ansaurus

Question

Python: Architecture for url polling and posting

Answer 1

+3 A:

Simplest: use a long-running process with sched (on its own thread) to handle the scheduling -- by posting requests to a Queue; have a fixed-size pool of threads (you can find a pre-made thread pool here, but it's easy to tweak it or roll your own) taking requests from the Queue (and returning results via a separate Queue). Registration and other system functions can be handled by a few more dedicated threads, if need be.

Threads aren't so bad, as long as (a) you never have to worry about synchronization among them (just have them communicate by intrinsically thread-safe Queue instances, never sharing access to any structure or subsystem that isn't strictly read-only), and (b) you never have too many (use a few dedicated threads for specialized functions, including scheduling, and a small thread-pool for general work -- never spawn a thread per request or anything like that, that will explode).

Twisted can be more scalable (at low hardware costs), but if you hinge your architecture on threading (and Queues) you have a built-in way to grow the system (by purchasing more hardware) to use the very similar multiprocessing module instead... almost a drop-in replacement, and a potential scaling up of orders of magnitude!-)

Alex Martelli 2009-11-15 23:08:06

Ok thank you. Doing it, it still looks like my biggest blocker in IO connections (95% of my time is in the `read()` of a socket, and 4% is `connect()`). Would some sort of persistent connection help? If so, any implementation recommendation?

Paul Tarjan 2009-11-16 00:31:12

Even if you were connecting to just one host, the best you could get with connection pools is a 4% speedup -- hardly worth the complexity (and you need to connect to many hosts anyway, right?). Just put enough threads in the pool to have enough simultaneous operations (rather than an arbitrary limit of 5, determine it empirically, start low but increase gradually if the numbers tell you that you need to). Tasks that spend that much time waiting for I/O from many sources are the most suitable ones for threading (though Twisted's even better, it's not by much!-).

Alex Martelli 2009-11-16 00:48:23

Good point. It is only 2 different sources sadly, so I fear I'll get blocked by them if I add too many threads. Looks like I'm nearing my performance limitations...

Paul Tarjan 2009-11-16 00:58:40

is `queue` better than `collections.deque` for this? And if my threads write state to the database (getting a new oauth token, etc) is that bad? Should I queue that operation (major refactor)?

Paul Tarjan 2009-11-16 01:17:03

collections.deque is not thread-safe: Queue.Queue is a thread-safe wrapper on it. Some DB APIs are threadsafe (typically when each thread holds a separate connection or in some cases even just a separate cursor) -- you'll need to check the specific one you want to use. But putting the DB behind a dedicated thread with its own Queue is anything *but* a major refactor -- though large enough to warrant a separate question;-). But if you think THAT is major, then Twisted or anythink of that ilk is **Right Out**!-). Stackless can't help much w/I-O bound tasks, btw.

Alex Martelli 2009-11-16 01:28:42

You've sure covered the gambit of technologies for me. Thanks! Here's a nice green checkmark.

Paul Tarjan 2009-11-16 01:39:22

@Paul, tx, and always happy to be of assistance! Looking at your cmt 2 ago, with only 2 hosts it might indeed help (a little;-) to have two separate pools (each thread holding a persistent connection) of 5 threads each (for throttling), and correspondingly two separate Queues -- so at least one host being very slow does not slow down the other, while ensuring <= 5 requests "in flight" for an y host at a given time.

Alex Martelli 2009-11-16 01:54:54

@Alex Martelli - I always learn something cool reading your answers. I'd never seen sched before. Thanks!

John Paulett 2009-11-16 01:55:40

@John, you're welcome! sched is a cool module, especially useful for simulations (since you can prime it with "fake" versions of time.time and time.sleep) but nice for real scheduling too (often in its own thread since, per se, it doesn't do threading;-).

Alex Martelli 2009-11-16 01:57:41

ansaurus

tags:

views:

answers:

Python: Architecture for url polling and posting

related questions