views:

428

answers:

3

If I wanted to have Python distributed across multiple processors on multiple computers, what would my best approach be? If I have 3 eight-core servers, that would mean I would have to run 24 python processes. I would be using the multiprocessing library, and to share objects it looks like the best idea would be to use a manager. I want all the nodes to work together as one big process, so one manager would be ideal, yet that would give my server a single point of failure. Is there a better solution? Would replicating a manager's object store be a good idea?

Also, if the manager is going to be doing all of the database querying, would it make sense to have it on the same machine as the database?

+3  A: 

I think more information would be helpful, on what sort of thing you are serving, what sort of database you'd use, what sort of latency/throughput requirements you have, etc. Lots of stuff depends on your requirements: eg. if your system is a typical server which has a lot of reads and not so many writes, and you don't have a problem with reading slightly stale data, you could perform local reads against a cache on each process and only push the writes to the database, broadcasting the results back to the caches.

For a start, I think it depends on what the manager has to do. After all, worrying about single points of failure may be pointless if your system is so trivial that failure is not going to occur short of catastrophic hardware failure. But if you just have one, having it on the same machine as the database makes sense. You reduce latency, and your system can't survive if one goes down without the other anyway.

Kylotan
+1: More information is needed.
S.Lott
+3  A: 

You have two main challenges in distributing the processes:

  1. Co-ordinating the work being split up, distributed and re-collected (mapped and reduced, you might say)
  2. Sharing the right live data between co-dependent processes

The answer to #1 will very much depend on what sort of processing you're doing. If it's easily horizontally partitionable (i.e. you can split the bigger task into several independent smaller tasks), a load balancer like HAProxy might be a convenient way to spread the load.

If the task isn't trivially horizontally partitionable, I'd first look to see if existing tools, like Hadoop, would work for me. Distributed task management is a difficult task to get right, and the wheel's already been invented.

As for #2, sharing state between the processes, your life will be much easier if you share an absolute minimum, and then only share it explicitly and in a well-defined way. I would personally use SQLAlchemy backed by your RDBMS of choice for even the smallest of tasks. The query interface is powerful and pain-free enough for small and large projects alike.

Alabaster Codify
A: 

Seems the gist of your question is how to share objects and state. More information, particularly size, frequency, rate of change, and source of data would be very helpful.

For cross machine shared memory you probably want to look at memcached. You can store your data and access it quickly and easy from any of the worker processes.

If your scenario is more of a simple job distribution model you might want to look at a queuing server - put your jobs and their associated data onto a queue and have the workers pick up jobs from the queue. Beanstalkd is probably a good choice for the queue, and here's a getting started tutorial.

Parand