views:

64

answers:

2

Hi,

Creating a web application (Django in my case, but I think the question is more general) that is administrating a cluster of workers doing queued jobs, there is a need to track each jobs progress.

When I've done it using database UPDATE (PostgreSQL in this case), it severely hits the database performance, because each UPDATE creates a new row in a table, and in my case only vacuuming DB removes obsolete rows. Having 30 jobs running and reporting progress every 1 minute DB may require vacuuming (and it means huge slow downs on a front end side for all the employees working with the system) every 10 days.

Because the progress information isn't critical, ie. it doesn't have to be persistent, how would you do the progress updates from jobs without using an overhead database implies? There are 30 worker servers, each doing 1 or 2 jobs simultaneously, 1 front end server which serves a web application to users, and 1 database server.

+1  A: 

There is a package called memcached which sets up a fast server for key-value retrieval. It's used by big clustered sites like wikipedia.

It lets you share frequent-changed data around your cluster without DB overhead.

amwinter
+1  A: 

If you are doing the inserts/updates/retreives based on keys (for example you are accessing the rows by ID every time) you can use the Django caching framework with any of the cache backends that can be shared between servers. amwinter suggested memcached. There's a memcached cache backend in the django distribution. But memecached doesn't guarantee it won't loose your data. For example you might be trying to store large amounts of data and memcached will start loosing your data when it hits a certain memory limit. So keep that in mind. There's an extension for memcached that can make it persist data (forgot what it was called).

You may also consider redis as a cache backend or MongoDB

Vasil