tags:

views:

89

answers:

5

Hello,

in my code I obtain two different lists from different sources, but I know they are in the same order. The first list ("names") contains a list of keys strings, while the second ("result_values") is a series of floats. I need to make the pair unique, but I can't use a dictionary as only the last value inserted would be kept: instead, I need to make an average (arithmetic mean) of the values that have a duplicate key.

Example of the wanted results:

names = ["pears", "apples", "pears", "bananas", "pears"]
result_values = [2, 1, 4, 8, 6] # ints here but it's the same conceptually

combined_result = average_duplicates(names, result_values)

print combined_result

{"pears": 4, "apples": 1, "bananas": 8}

My only ideas involve multiple iterations and so far have been ugly... is there an elegant solution to this problem?

+3  A: 
from collections import defaultdict
def averages(names, values):
    # Group the items by name.
    value_lists = defaultdict(list)
    for name, value in zip(names, values):
        value_lists[name].append(value)

    # Take the average of each list.
    result = {}
    for name, values in value_lists.iteritems():
        result[name] = sum(values) / float(len(values))
    return result

names = ["pears", "apples", "pears", "bananas", "pears"]
result_values = [2, 1, 4, 8, 6]
print averages(names, result_values)
Glenn Maynard
Exactly what I was typing :)
larsmans
Thanks, I'll give it a go.
Einar
mines better :P
aaronasterling
@aaronasterling: Yours doesn't work D:
Glenn Maynard
Ok, two fixed typos and it works. Now mine's better ;)
aaronasterling
+3  A: 

I would use a dictionary anyways

averages = {}
counts = {}
for name, value in zip(names, result_values):
    if name in averages:
        averages[name] += value
        counts[name] += 1
    else:
        averages[name] = value
        counts[name] = 1
for name in averages:
    averages[name] = averages[name]/float(counts[name]) 

If you're concerned with large lists, then I would replace zip with izip from itertools.

aaronasterling
You have strange indentation in last lines.
Constantin
@Constantin, That was one more typo than I thought I had. Good looking out.
aaronasterling
Finally I can +1 this answer :)
Constantin
+1  A: 

I think what you're looking for is itertools.groupby:

import itertools

def average_duplicates(names, values):
  pairs = sorted(zip(names, values))
  result = {}
  for key, group in itertools.groupby(pairs, key=lambda p: p[0]):
    group_values = [value for (_, value) in group]
    result[key] = sum(group_values) / len(group_values)
  return result

See also zip and sorted.

Constantin
How would it fare, performance-wise, to the other solutions? I'm interested as I may have long lists and so performance may be an issue.
Einar
@Einar, it could be faster, because `groupby` does not create a copy of data, and it could be slower because of `sorted`. I'll have to measure.
Constantin
More precisely, none of these copy the data--they only create new containers holding them. The data itself just gets new references taken. `groupby` doesn't create a new list, but note that `zip`, `sorted` and the list comprehension for `group_values` do.
Glenn Maynard
Also note that this one will probably be affected by the order of `names`: `sorted` will be much faster if it's already partially sorted than if not.
Glenn Maynard
@Einar, aaronasterling's solution is fastest when number of distinct names is large. When there are a few distinct names, Glenn Maynard's solution is fastest. My solution loses to them at least 3x on large lists. This is for `Python 2.6.5 (r265:79096, Mar 19 2010, 21:48:26) [MSC v.1500 32 bit (Intel)] on win32`.
Constantin
Thanks, will update with aaronasterling's solution then, as I may have a large number of distinct names and few duplicates.
Einar
+2  A: 

You could calculate the mean using a Cumulative moving average to only iterate through the lists once:

from collections import defaultdict
averages = defaultdict(float)
count = defaultdict(int)

for name,result in zip(names,result_values):
    count[name] += 1
    averages[name] += (result - averages[name]) / count[name]
Dave Webb
Interesting tip, I'll use it for larger data sets.
Einar
The "Cumulative moving average" will give you the same result as the standard mean so you could use it for all your data sets.
Dave Webb
A: 
>>> def avg_list(keys, values):
...     def avg(series):
...             return sum(series) / len(series)
...     from collections import defaultdict
...     d = defaultdict(list)
...     for k, v in zip(keys, values):
...             d[k].append(v)
...     return dict((k, avg(v)) for k, v in d.iteritems())
... 
>>> if __name__ == '__main__':
...     names = ["pears", "apples", "pears", "bananas", "pears"]
...     result_values = [2, 1, 4, 8, 6]
...     print avg_list(names, result_values)
... 
{'apples': 1, 'pears': 4, 'bananas': 8}

You can have avg() return float(len(series)) if you want a floating point average.

hughdbrown