views:

114

answers:

4

I have a very long list of dictionaries with string indices and integer values. Many of the keys are the same across the dictionaries, though not all. I want to generate one dictionary in which the keys are the union of the keys in the separate dictionaries and the values are the sum of all the values corresponding to that key in each of the dictionaries. (For example, the value for the key 'apple' in the combined dictionary will be the sum of the value of 'apple' in the first, plus the sum of the value of 'apple' in the second, etc.)

I have the following, but it's rather cumbersome and takes ages to execute. Is there a simpler way to achieve the same result?

comb_dict = {}  
for dictionary in list_dictionaries:  
    for key in dictionary:  
        comb_dict.setdefault(key, 0)  
        comb_dict[key] += dictionary[key]  
return comb_dict
+1  A: 

Use collections.defaultdict instead.

http://docs.python.org/library/collections.html#defaultdict-objects

Slightly simpler.

S.Lott
`update` overwrites values
katrielalex
+1  A: 

This could be fast too, but it really depends on your data. It avoids all the changing dicts or extra lists - just one set of all keys and lots of reads :-)

from itertools import chain

def union( dict_list ):
    all_keys = set(chain.from_iterable(dict_list))
    for key in all_keys:
        yield key, sum( d.get(key,0) for d in dict_list)

combined = dict(union( dict_list ))
THC4k
Although this uses more sophisticated functions, I cannot imagine that this will be faster (but I could be wrong). In the OP's code, the list of dictionaries is only traversed once, so is every dictionary. In your code, every dictionary gets traversed once to create the set of keys and then the list of dicts gets traversed `#all_keys` times.
Felix Kling
Felix Kling: Well I just tried, when i add a iterator (see revisions ;-) to traverse only once it gets even slower. Guess that extra hasing from the set it the problem.
THC4k
A: 

You could take some inspiration from google's map-reduce. From what I understand it was designed to solve just this type of problem.

Swizec Teller
+9  A: 

Here are some microbenchmarks which suggest f2 (see below) might be an improvement. f2 uses iteritems which allows you avoid an extra dict lookup in the inner loop:

import collections
import string
import random

def random_dict():
    n=random.randint(1,26)
    keys=list(string.letters)
    random.shuffle(keys)
    keys=keys[:n]
    values=[random.randint(1,100) for _ in range(n)]    
    return dict(zip(keys,values))

list_dictionaries=[random_dict() for x in xrange(100)]

def f1(list_dictionaries):
    comb_dict = {}  
    for dictionary in list_dictionaries:  
        for key in dictionary:  
            comb_dict.setdefault(key, 0)  
            comb_dict[key] += dictionary[key]  
    return comb_dict

def f2(list_dictionaries):    
    comb_dict = collections.defaultdict(int)
    for dictionary in list_dictionaries:  
        for key,value in dictionary.iteritems():  
            comb_dict[key] += value
    return comb_dict

def union( dict_list ):
    all_keys = set()
    for d in dict_list:
        for k in d:
            all_keys.add( k )
    for key in all_keys:
        yield key, sum( d.get(key,0) for d in dict_list)

def f3(list_dictionaries):
    return dict(union( list_dictionaries ))

Here are the results:

% python -mtimeit -s"import test" "test.f1(test.list_dictionaries)"
1000 loops, best of 3: 776 usec per loop
% python -mtimeit -s"import test" "test.f2(test.list_dictionaries)"
1000 loops, best of 3: 432 usec per loop    
% python -mtimeit -s"import test" "test.f3(test.list_dictionaries)"
100 loops, best of 3: 2.19 msec per loop
unutbu
Thanks! f2() actually cut about 80% of the time out for my particular application. YRMV, obviously.
thebackhand