views:

68

answers:

1

I would like to create some sort of a distributed setup for running a ton of small/simple REST web queries in a production environment. For each 5-10 related queries which are executed from a node, I will generate a very small amount of derived data, which will need to be stored in a standard relational database (such as PostgreSQL).

What platforms are built for this type of problem set? The nature, data sizes, and quantities seem to contradict the mindset of Hadoop. There are also more grid based architectures such as Condor and Sun Grid Engine, which I have seen mentioned. I'm not sure if these platforms have any recovery from errors though (checking if a job succeeds).

What I would really like is a FIFO type queue that I could add jobs to, with the end result of my database getting updated.

Any suggestions on the best tool for the job?

+1  A: 

Have you looked at Celery?

drg
The projects looks interesting, although quite young. I also am not sure about its robustness, based on the FAQ: "One reason that the queue is never emptied could be that you have a stale celery process taking the messages hostage. This could happen if celeryd wasn’t properly shut down." Also, the django dependency is kind of annoying: "While it is possible to use Celery from outside of Django, we still need Django itself to run, this is to use the ORM and cache-framework."
EmpireJones
@empirejones Actually that FAQ entry is not relevant anymore. This was about deleting currently waiting jobs in the queue. A worker may reserve some jobs in advance (because of the prefetch count), if the worker drops the broker connection, the jobs are sent elsewhere (or the same worker if it reconnects). The related bugs are now fixed, it turns it was a problem with multiprocessing and forking.
asksol