tags:

views:

690

answers:

3

hey all, I have a file on disk that's only 168MB. It's just a comma separated list of word,id the word can be 1-5 words long. There's 6.5 million lines. I created a dictionary in python to load this up into memory so I can search incoming text against that list of words. When python loads it up into memory it shows 1.3 GB's of RAM space used. Any idea why that is?

so let's say my word file looks like this...

1,word1
2,word2
3,word3

then add 6.5 million to that I then loop through that file and create a dictionary (python 2.6.1)

  def load_term_cache():
      """will load the term cache from our cached file instead of hitting mysql. If it didn't 
      preload into memory it would be 20+ million queries per process"""
      global cached_terms
      dumpfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms.txt')
      f = open(dumpfile)
      cache = csv.reader(f)
      for term_id, term in cache:
          cached_terms[term] = term_id
      f.close()

Just doing that blows up the memory. I view activity monitor and it pegs the memory to all available up to around 1.5GB of RAM On my laptop it just starts to swap. Any ideas how to most efficiently store key/value pairs in memory with python?

thanks

UPDATE: I tried to use the anydb module and after 4.4 million records it just dies the floating point number is the elapsed seconds since I tried to load it

56.95
3400018
60.12
3600019
63.27
3800020
66.43
4000021
69.59
4200022
72.75
4400023
83.42
4600024
168.61
4800025
338.57

you can see it was running great. 200,000 rows every few seconds inserted until I hit a wall and time doubled.

  import anydbm
  i=0
  mark=0
  starttime = time.time()
  dbfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms')
  db = anydbm.open(dbfile, 'c')
  #load from existing baseterm file
  termfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms.txt.LARGE')
  for line in open(termfile):
    i += 1
    pieces = line.split(',')
    db[str(pieces[1])] = str(pieces[0])
    if i > mark:
      print i
      print round(time.time() - starttime, 2)
      mark = i + 200000
  db.close()
+2  A: 

convert your data into a dbm (import anydbm, or use berkerley db by import bsddb ...), and then use dbm API to access it.

the reason to explode is that python has extra meta information for any objects, and the dict needs to construct a hash table (which would require more memory). you just created so many objects (6.5M) so the metadata becomes too huge.

import bsddb
a = bsddb.btopen('a.bdb') # you can also try bsddb.hashopen
for x in xrange(10500) :
  a['word%d' %x] = '%d' %x
a.close()

This code takes only 1 second to run, so I think the speed is OK (since you said 10500 lines per second). btopen creates a db file with 499,712 bytes in length, and hashopen creates 319,488 bytes.

With xrange input as 6.5M and using btopen, I got 417,080KB in ouput file size and around 1 or 2 minute to complete insertion. So I think it's totally suitable for you.

Francis
the reason I moved to in memory from the database is I had it in a mysql table with the word as the primary key (there can only be one unique word or phrase). I'm doing around 20 million queries to process one hour of data. Moving it to memory saved around 10 minutes off the processing time. Do you have an idea that you were thinking of to help with that situation?
You don't need SQL database (they are powerful but slower). The DBM are hash-value style databases so they should be fast (very similiar to dict objects) and need smaller memory footprint than SQL. Berkerley db would be my best choice (import bsddb).
Francis
@beagleguy: dbm-style databases are key-value oriented, not SQL. Modiying your program to work with one of them, now that you've got ir working on a dictionary, won't take much time (vs. going from MySQL to dict), and then you can compare with MySQL.It will always be better than having 1.3G of RAM taken over by your data :)
Heim
Also, if you plan to analyze various datasets concurrently from a number of different computers, you may want to have a look to key-value databases like Redis
Heim
hi Heim, I actually tried using Redis tonight for just that purpose. It took 10 minutes to insert the data in. I was averaging 10,500 inserts per second, 10 times less than the redis benchmarks. I was using the python redis client module. Also redis took up around 1GB of memory for the same 168MB file.
Mmh... Well, Redis was just an example. You mention millions of keys and Redis keeps all the keys in memory, so you'll see a similar problem than you have in Python.Anyway, using a networked db only makes sense in this case if you really need it. I'd say just go for dbm if you only need local access.
Heim
See my edits - using btopen creates 400M file db and only 1 or 2 minutes to complete on my computer.
Francis
hi Francis... I looked at the docs and saw this.. "Deprecated since version 2.6: The bsddb module has been deprecated for removal in Python 3.0."does shelve do the same thing?
Well, I think you don't need to worry about 3.0 unless you are very determined to use 3.0 "now". shelve is more like a python native storage, so it should work similiar but there's no promise on performance.
Francis
One more note: they removed it from standard distribution but still allow it as an external module. see http://mail.python.org/pipermail/python-dev/2008-July/081379.html and pybsddb: http://www.jcea.es/programacion/pybsddb.htm which is compatible with Python 3.1
Francis
P.S: I've just tried shelve on Python 2.6. In brief, it's more like bsddb.hashopen and insertion is slow after data grows really huge, when being compared to btopen in bsddb.
Francis
yea shelve is killin me right now... it's been running for 20 minutes... the original file is 6.5 million lines and the shelve file is over 35 million lines so far. Wonder if sqlite might fair any better
why not pybsddb?
Francis
+9  A: 

Take a look (Python 2.6, 32-bit version)...:

>>> sys.getsizeof('word,1')
30
>>> sys.getsizeof(('word', '1'))
36
>>> sys.getsizeof(dict(word='1'))
140

The string (taking 6 bytes on disk, clearly) gets an overhead of 24 bytes (no matter how long it is, add 24 to its length to find how much memory it takes). When you split it into a tuple, that's a little bit more. But the dict is what really blows things up: even an empty dict takes 140 bytes -- pure overhead of maintaining a blazingly-fast hash-based lookup take. To be fast, a hash table must have low density -- and Python ensures a dict is always low density (by taking up a lot of extra memory for it).

The most memory-efficient way to store key / value pairs is as a list of tuples, but lookup of course will be very slow (even if you sort the list and use bisect for the lookup, it's still going to be extremely slower than a dict).

Consider using shelve instead -- that will use little memory (since the data reside on disk) and still offer pretty spiffy lookup performance (not as fast as an in-memory dict, of course, but for a large amount of data it will be much faster than lookup on a list of tuples, even a sorted one, can ever be!-).

Alex Martelli
thanks alex for some great info, I'm trying out shelve right now and will let you know how it goes
hrm shelve was a no go, it just started dying after 5 million rows and I wound up killing the insert after 20 minutes
@beagleguy: Did you try shelve with protocol=-1?
John Machin
`protocol=-1` is a good suggestion, but I don't know what "started dying" means -- probably just the trashing behavior causing slow operation once you exhaust physical memory. So the next thing to try is a real database, whether relational or not -- `bsddb` is one (non-relational), and implicitly used by `anydbm` (and, with one indirection, `shelve`), but the version that comes with Python is not the best or fastest. I'd try the `sqlite` that does come with Python, first, to avoid installing any further pieces; if still not enough, third-party DBs are next.
Alex Martelli
+1  A: 
John Machin
hey John, ok I made the edit. hope that helps show something I'm doing dumb :)
doh, yea I wrote it wrong, I edited it to show how the data is stored on disk 1,word
yes it's all stripped before it gets saved using wordstr.strip()
@beagleguy: Please post the code that you are ACTUALLY RUNNING, exactly i.e. prefer copy/paste. And answer the other questions.
John Machin
that code is the code that's actually running to load the term file
@beagleguy: PLease answer questions by editing your question, not in a comment.
John Machin