views:

286

answers:

6

Hi all,

I need to optimize the RAM usage of my application.
PLEASE spare me the lectures telling me I shouldn't care about memory when coding Python. I have a memory problem because I use very large default-dictionaries (yes, I also want to be fast). My current memory consumption is 350MB and growing. I already cannot use shared hosting and if my Apache opens more processes the memory doubles and triples... and it is expensive.
I have done extensive profiling and I know exactly where my problems are.
I have several large (>100K entries) dictionaries with Unicode keys. A dictionary starts at 140 bytes and grows fast, but the bigger problem is the keys. Python optimizes strings in memory (or so I've read) so that lookups can be ID comparisons ('interning' them). Not sure this is also true for unicode strings (I was not able to 'intern' them).
The objects stored in the dictionary are lists of tuples (an_object, an int, an int).

my_big_dict[some_unicode_string].append((my_object, an_int, another_int))

I already found that it is worth while to split to several dictionaries because the tuples take a lot of space...
I found that I could save RAM by hashing the strings before using them as keys! But then, sadly, I ran into birthday collisions on my 32 bit system. (side question: is there a 64-bit key dictionary I can use on a 32-bit system?)

Python 2.6.5 on both Linux(production) and Windows. Any tips on optimizing memory usage of dictionaries / lists / tuples? I even thought of using C - I don't care if this very small piece of code is ugly. It is just a singular location.

Thanks in advance!

+1  A: 

Use shelve or a database to store the data instead of an in-memory dict.

Ignacio Vazquez-Abrams
+1  A: 

I've had situations where I've had a collection of large objects that I've needed to sort and filter by different methods based on several metadata properties. I didn't need the larger parts of them so I dumped them to disk.

As you data is so simple in type, a quick SQLite database might solve all your problems, even speed things up a little.

Oli
+2  A: 

For a web application you should use a database, the way you're doing it you are creating one copy of your dict for each apache process, which is extremely wasteful. If you have enough memory on the server the database table will be cached in memory (if you don't have enough for one copy of your table, put more RAM into the server). Just remember to put correct indices on your database table or you will get bad performance.

Fabian
A: 

If you want to do extensive optimization and have full control on memory usage you could also write a C/C++ module. Using Swig the code wrapping into Python can be done easily, with some small performance overhead comparing to pure C Python module.

zoli2k
+1  A: 

If you want to stay with the in-memory data store, you could try something like memcached.

That way, you can use a single in-memory key/value-store from all the Python processes.

There are several python memcached client libraries.

codeape
memcached is lossy, and so is unsuitable for a datastore.
Ignacio Vazquez-Abrams
+1  A: 

Redis would be a great option here if you have the option to use it on a shared host - similar to memcached, but optimised for data structures. Redis also supports python bindings.

I use it on a day to day basis for number crunching but also in production systems as a datastore and cannot recommend it highly enough.

Als, di you have an topion to proxy your app behind nginx instead fo using Apache? You might find (if allowed) this proxy/webapp arrangement less hungry on resources.

Good luck.

Ben Hughes