ansaurus

Question

Fastest ways to key-wise add a list of dicts together in python

Answer 1

+2 A:

It may be proven through profiling that this isn't quite the fastest but...

import collections

a = {'x': 1.0, 'y': 0.5, 'z': 0.25 }
b = {'w': 0.5, 'x': 0.2 }
dicts = [a,b]

totals = collections.defaultdict(list)
avg = {}

for D in dicts:
    for key,value in D.iteritems():
        totals[key].append(value)

for key,values in totals.iteritems():
   avg[key] = sum(values) / len(values)

I'm guessing that allowing Python to use the built-ins sum() and len() is going to gain some performance over calculating the mean as you see new values, but I could sure be wrong about that.

Triptych 2009-08-19 16:48:07

Answer 2

+2 A:

This works:

import collections

data= [
    {'x': 1.0, 'y': 0.5, 'z': 0.25 },
    {'w': 0.5, 'x': 0.2 }
    ]

tally = collections.defaultdict(lambda: (0.0, 0))

for d in data:
    for k,v in d.items():
        sum, count = tally[k]
        tally[k] = (sum+v, count+1)

results = {}
for k, v in tally.items():
    t = tally[k]
    results[k] = t[0]/t[1]

print results

I don't know if it's faster than yours, since you haven't posted your code.

{'y': 0.5, 'x': 0.59999999999999998, 'z': 0.25, 'w': 0.5}

I tried in tally to avoid storing all the values again, simply accumulating the sum and count I'd need to compute the average at the end. Often, the time bottleneck in a Python program is in the memory allocator, and using less memory can help a lot with speed.

Ned Batchelder 2009-08-19 16:48:34

I've edited the question to clarify what the result *should* be

Andrew Ingram 2009-08-19 17:58:52

Answer 3

+1 A:

>>> def avg(items):
...     return sum(items) / len(items)
... 
>>> hashes = [a, b]
>>> dict([(k, avg([h.get(k) or 0 for h in hashes])) for k in set(sum((h.keys() for h in hashes), []))])
{'y': 0.25, 'x': 0.59999999999999998, 'z': 0.125, 'w': 0.25}

Explanation:

The set of keys in all of the hashes, no repeats.
```
set(sum((h.keys() for h in hashes), []))
```
The average value for each key in the above set, using 0 if the value doesn't exist in a particular hash.
```
(k, avg([h.get(k) or 0 for h in hashes]))
```

John Kugelman 2009-08-19 16:49:44

Use `item is not None` instead of `item != None`, that's a whole lot faster.

balpha 2009-08-19 17:47:28

I've edited the question to clarify what the result *should* be

Andrew Ingram 2009-08-19 17:54:29

Edited answer to match.

John Kugelman 2009-08-19 18:32:37

@balpha: can you explain the difference?

Otto Allmendinger 2009-08-19 18:50:07

@Otto Allmendinger: See these two questions: http://stackoverflow.com/questions/132988/ and http://stackoverflow.com/questions/26595/

balpha 2009-08-20 06:33:03

Answer 4

A:

It is possible that your bottleneck might be due to excessive memory use. Consider using iteritems to leverage the power of generators.

Since you say your data is sparse, that will probably not be the most efficient. Consider this alternate usage of iterators:

dicts = ... #Assume this is your dataset
totals = {}
lengths = {}
means = {}
for d in dicts:
    for key,value in d.iteritems():
        totals.setdefault(key,0)
        lengths.setdefault(key,0)
        totals[key] += value
        length[key] += 1
for key,value in totals.iteritems():
    means[key] = value / lengths[key]

Here totals, lengths, and means are the only data structures you create. This ought to be fairly speedy, since it avoids having to create auxiliary lists and only loops through each dictionary exactly once per key it contains.

Here's a second approach that I doubt will be an improvement in performance over the first, but it theoretically could, depending on your data and machine, since it will require less memory allocation:

dicts = ... #Assume this is your dataset
key_set = Set([])
for d in dicts: key_set.update(d.keys())
means = {}
def get_total(dicts, key):
    vals = (dict[key] for dict in dicts if dict.has_key(key))
    return sum(vals)
def get_length(dicts, key):
    vals = (1 for dict in dicts if dict.has_key(key))
    return sum(vals)
def get_mean(dicts,key):
    return get_total(dicts,key)/get_length(dicts,key)
for key in key_set:
    means[key] = get_mean(dicts,key)

You do end up looping through all dictionaries twice for each key, but need no intermediate data structures other than the key_set.

David Berger 2009-08-19 17:45:58

Answer 5

A:

scipy.sparse supports sparse matrices -- the dok_matrix form seems reasonably suited to your needs (you'll have to use integer coordinates, though, so a separate pass will be needed to collect and put in any arbitrary but definite order the string keys you currently have). If you have a huge number of very large and sparse "arrays", the performance gains might possibly be worth the complications.

Alex Martelli 2009-08-19 18:05:44

Answer 6

A:

It's simple but this could work:

a = { 'x': 1.0, 'y': 0.5, 'z': 0.25 }
b = { 'w': 0.5, 'x': 0.2 }

ds = [a, b]
result = {}

for d in ds:
    for k, v in d.iteritems():
        result[k] = v + result.get(k, 0)

n = len(ds)
result = dict((k, amt/n) for k, amt in result.iteritems())

print result

I have no idea how it compares to your method since you didn't post any code.

Steve Losh 2009-08-19 19:07:34

ansaurus

tags:

views:

answers:

Fastest ways to key-wise add a list of dicts together in python

related questions