views:

257

answers:

2

Hi everyone,

I have pretty standard Django+Rabbitmq+Celery setup with 1 Celery task and 5 workers.

Task uploads the same (I simplify a bit) big file (~100MB) asynchronously to a number of remote PCs.

All is working fine at the expense of using lots of memory, since every task/worker load that big file into memory separatelly.

What I would like to do is to have some kind of cache, accessible to all tasks, i.e. load the file only once. Django caching based on locmem would be perfect, but like documentation says: "each process will have its own private cache instance" and I need this cache accessible to all workers.

Tried to play with Celery signals like described in #2129820, but that's not what I need.

So the question is: is there a way I can define something global in Celery (like a class based on dict, where I could load the file or smth). Or is there a Django trick I could use in this situation ?

Thanks.

A: 

It seems to me that what you need is memcached backed for django. That way each task in Celery will have access to it.

Łukasz
I thought about it, however the biggest value one can store in memcached is 1 MB.
Lauris
Why not partition the file? And if every task requires access to every bit of this file then there's no way of avoiding loading it every time.
Łukasz
Well, I'm hoping it is possible :). Partitioning would increase the complexity and I think there should be simper way to tackle this.
Lauris
Shared memory across different processes? If all tasks are running on the same machine (if you're using single Celery server) you can try using http://pypi.python.org/pypi/posix_ipc
Łukasz
I'm using single Celery server, yes. posix_ipc is certainly interesting but I feel is too low level so solve this problem. I believe that solution lies somewhere in Django caching or custom Celery loader or smth alike.
Lauris
Perhaps use a combination of Amazon S3 (or *some* file store) + Memcached - memcached can simply store a location in S3 for all other tasks to download and work on.
rlotun
Thanks everyone for your ideas. To keep it simple I'll probably end up uploading files in chunks of few MB.
Lauris
A: 

Maybe you can use threads instead of processes for this particular task. Since threads all share the same memory, you only need one copy of the data in memory, but you still get parallel execution. ( this means not using Celery for this task )

Nick Perkins