ansaurus

Question

Python memory usage? loading large dictionaries in memory

Answer 1

+2 A:

convert your data into a dbm (import anydbm, or use berkerley db by import bsddb ...), and then use dbm API to access it.

the reason to explode is that python has extra meta information for any objects, and the dict needs to construct a hash table (which would require more memory). you just created so many objects (6.5M) so the metadata becomes too huge.

import bsddb
a = bsddb.btopen('a.bdb') # you can also try bsddb.hashopen
for x in xrange(10500) :
  a['word%d' %x] = '%d' %x
a.close()

This code takes only 1 second to run, so I think the speed is OK (since you said 10500 lines per second). btopen creates a db file with 499,712 bytes in length, and hashopen creates 319,488 bytes.

With xrange input as 6.5M and using btopen, I got 417,080KB in ouput file size and around 1 or 2 minute to complete insertion. So I think it's totally suitable for you.

Francis 2010-02-06 04:06:47

the reason I moved to in memory from the database is I had it in a mysql table with the word as the primary key (there can only be one unique word or phrase). I'm doing around 20 million queries to process one hour of data. Moving it to memory saved around 10 minutes off the processing time. Do you have an idea that you were thinking of to help with that situation?

2010-02-06 04:11:00

You don't need SQL database (they are powerful but slower). The DBM are hash-value style databases so they should be fast (very similiar to dict objects) and need smaller memory footprint than SQL. Berkerley db would be my best choice (import bsddb).

Francis 2010-02-06 04:17:21

@beagleguy: dbm-style databases are key-value oriented, not SQL. Modiying your program to work with one of them, now that you've got ir working on a dictionary, won't take much time (vs. going from MySQL to dict), and then you can compare with MySQL.It will always be better than having 1.3G of RAM taken over by your data :)

Heim 2010-02-06 04:18:28

Also, if you plan to analyze various datasets concurrently from a number of different computers, you may want to have a look to key-value databases like Redis

Heim 2010-02-06 04:20:19

hi Heim, I actually tried using Redis tonight for just that purpose. It took 10 minutes to insert the data in. I was averaging 10,500 inserts per second, 10 times less than the redis benchmarks. I was using the python redis client module. Also redis took up around 1GB of memory for the same 168MB file.

2010-02-06 04:23:21

Mmh... Well, Redis was just an example. You mention millions of keys and Redis keeps all the keys in memory, so you'll see a similar problem than you have in Python.Anyway, using a networked db only makes sense in this case if you really need it. I'd say just go for dbm if you only need local access.

Heim 2010-02-06 04:33:05

See my edits - using btopen creates 400M file db and only 1 or 2 minutes to complete on my computer.

Francis 2010-02-06 04:49:47

hi Francis... I looked at the docs and saw this.. "Deprecated since version 2.6: The bsddb module has been deprecated for removal in Python 3.0."does shelve do the same thing?

2010-02-06 04:50:59

Well, I think you don't need to worry about 3.0 unless you are very determined to use 3.0 "now". shelve is more like a python native storage, so it should work similiar but there's no promise on performance.

Francis 2010-02-06 05:04:09

One more note: they removed it from standard distribution but still allow it as an external module. see http://mail.python.org/pipermail/python-dev/2008-July/081379.html and pybsddb: http://www.jcea.es/programacion/pybsddb.htm which is compatible with Python 3.1

Francis 2010-02-06 05:14:10

P.S: I've just tried shelve on Python 2.6. In brief, it's more like bsddb.hashopen and insertion is slow after data grows really huge, when being compared to btopen in bsddb.

Francis 2010-02-06 05:15:40

yea shelve is killin me right now... it's been running for 20 minutes... the original file is 6.5 million lines and the shelve file is over 35 million lines so far. Wonder if sqlite might fair any better

2010-02-06 05:36:50

why not pybsddb?

Francis 2010-02-06 06:40:58

Answer 2

+9 A:

Take a look (Python 2.6, 32-bit version)...:

>>> sys.getsizeof('word,1')
30
>>> sys.getsizeof(('word', '1'))
36
>>> sys.getsizeof(dict(word='1'))
140

The string (taking 6 bytes on disk, clearly) gets an overhead of 24 bytes (no matter how long it is, add 24 to its length to find how much memory it takes). When you split it into a tuple, that's a little bit more. But the dict is what really blows things up: even an empty dict takes 140 bytes -- pure overhead of maintaining a blazingly-fast hash-based lookup take. To be fast, a hash table must have low density -- and Python ensures a dict is always low density (by taking up a lot of extra memory for it).

The most memory-efficient way to store key / value pairs is as a list of tuples, but lookup of course will be very slow (even if you sort the list and use bisect for the lookup, it's still going to be extremely slower than a dict).

Consider using shelve instead -- that will use little memory (since the data reside on disk) and still offer pretty spiffy lookup performance (not as fast as an in-memory dict, of course, but for a large amount of data it will be much faster than lookup on a list of tuples, even a sorted one, can ever be!-).

Alex Martelli 2010-02-06 04:07:56

thanks alex for some great info, I'm trying out shelve right now and will let you know how it goes

2010-02-06 04:26:37

hrm shelve was a no go, it just started dying after 5 million rows and I wound up killing the insert after 20 minutes

2010-02-06 06:10:25

@beagleguy: Did you try shelve with protocol=-1?

John Machin 2010-02-06 11:55:16

`protocol=-1` is a good suggestion, but I don't know what "started dying" means -- probably just the trashing behavior causing slow operation once you exhaust physical memory. So the next thing to try is a real database, whether relational or not -- `bsddb` is one (non-relational), and implicitly used by `anydbm` (and, with one indirection, `shelve`), but the version that comes with Python is not the best or fastest. I'd try the `sqlite` that does come with Python, first, to avoid installing any further pieces; if still not enough, third-party DBs are next.

Alex Martelli 2010-02-06 16:44:21

Answer 3

+1 A:

John Machin 2010-02-06 04:09:05

hey John, ok I made the edit. hope that helps show something I'm doing dumb :)

2010-02-06 04:21:43

doh, yea I wrote it wrong, I edited it to show how the data is stored on disk 1,word

2010-02-06 04:38:51

yes it's all stripped before it gets saved using wordstr.strip()

2010-02-06 04:46:50

@beagleguy: Please post the code that you are ACTUALLY RUNNING, exactly i.e. prefer copy/paste. And answer the other questions.

John Machin 2010-02-06 04:47:00

that code is the code that's actually running to load the term file

2010-02-06 04:47:25

@beagleguy: PLease answer questions by editing your question, not in a comment.

John Machin 2010-02-06 04:48:08

ansaurus

tags:

views:

answers:

Python memory usage? loading large dictionaries in memory

related questions