ansaurus

Question

Answer 1

A:

Something like this, if I understand your question correctly

from collections import defaultdict
import pickle

result = defaultdict(int)
for fn in filenames:
    data_dict = pickle.load(open(fn))
    for k,count in data_dict.items():
        word,corpus = k
        result[k]+=count

gnibbler 2010-02-17 02:25:48

Answer 2

+2 A:

A disk-based dictionary-like exists -- see the shelve module. Keys into a shelf must be strings, but you could simply use str on your tuples to obtain equivalent string keys; plus, I read your Q as meaning that you want only word as the key, so that's even easier (either str -- or, for vocabularies < 4GB, a struct.pack -- will be fine).

A good relational engine (especially PostgreSQL) would serve you well, but processing one dictionary at a time to aggregate each word occurrences over all corpora into a shelf object should also be OK (not quite as fast, but simpler to code, since a shelf is so similar to a dict except for the type constraint on keys [[and a caveat for mutable values, but as your values are ints that need not concern you).

Alex Martelli 2010-02-17 02:29:11

Answer 3

A:

If I understood your question correctly and you have integer ids for the words and corpora, then you can gain some performance by switching from a dict to a list, or even better, a numpy array. This may be annoying!

Basically, you need to replace the tuple with a single integer, which we can call the newid. You want all the newids to correspond to a word,corpus pair, so I would count the words in each corpus, and then have, for each corpus, a starting newid. The newid of (word,corpus) will then be word + start_newid[corpus].

If I misunderstood you and you don't have such ids, then I think this advice might still be useful, but you will have to manipulate your data to get it into the tuple of ints format.
Another thing you could try is rechunking the data.

Let's say that you can only hold 1.1 of these monsters in memory. Then, you can load one, and create a smaller dict or array that only corresponds to the first 10% of (word,corpus) pairs. You can scan through the loaded dict, and deal with any of the ones that are in the first 10%. When you are done, you can write the result back to disk, and do another pass for the second 10%. This will require 10 passes, but that might be OK for you.

If you chose your previous chunking based on what would fit in memory, then you will have to arbitrarily break your old dicts in half so that you can hold one in memory while also holding the result dict/array.

forefinger 2010-02-17 03:42:42

ansaurus

tags:

views:

answers:

merging dictionaries in python

related questions