ansaurus

Question

Summarizing a dictionary of arrays in Python

Answer 1

+6 A:

It's easy to do with a sort:

sorted(mydict.iteritems(), key=lambda tup: sum(tup[1]), reverse=True)[:3]

This is reasonable if the ratio is similar to this one (3 / 5). If it's larger, you'll want to avoid the sort (O(n log n)), since top 3 can be done in O(n). For instance, using heapq, the heap module:

heapq.nlargest(3, mydict.iteritems(), key=lambda tup: sum(tup[1]))

This is O(n + 3 log n), since assembly the initial heap is O(n) and re-heapifying is O(log n).

EDIT: If you're using Python 2.7 or later, you can easily convert to a OrderedDict (equivalent version for Python 2.4+):

OrderedDict(heapq.nlargest(3, mydict.iteritems(), key=lambda tup: sum(tup[1])))

OrderedDict has the same API as dict, but remembers insertion order.

Matthew Flaschen 2010-08-05 01:04:44

How do you get O(n + 3 log n), it should be O(n log k), or when k = 3 the constant cancels out and you'll get O(n)

Ants Aasma 2010-08-05 14:18:09

In my real world example, it's rather the top 100 out of a couple hundred thousand, therefore the heapq example is probably to be preferred. Thanks.

poezn 2010-08-05 17:43:08

Just realized that this doesn't give me a dictionary but an array of arrays. Any ideas?

poezn 2010-08-05 20:52:02

@Ants, can you explain how you calculate that? Wikipedia [gives](http://en.wikipedia.org/wiki/Selection_algorithm#Language_support) O(n + k log n), and I think that checks out, based on the reason I explained in my answer.

Matthew Flaschen 2010-08-06 00:07:42

Answer 2

+2 A:

For such a small slice it's not worth using islice

sorted(mydict.iteritems(), key=lambda (k,v): sum(v), reverse=True)[:3]

gnibbler 2010-08-05 01:14:29

Answer 3

+1 A:

The best that I can come up with is:

  decorated = ((sum(mydict[key]), key) for key in mydict)  
  topn = heapq.nlargest(n, decorated)
  result = dict([(key, mydict[key]) for s, key in topn])

Using the decorate-sort-undecorate pattern eliminates the need to pass in a lambda so that as much code as possible can run in C. (I don't know if this applies to the constructor function of the heap but it's a good habit to get into anyways)

another advantage of this pattern in this case is that it lets us recover the keys after the sort to build the result dictionary

aaronasterling 2010-08-06 00:45:25

ansaurus

tags:

views:

answers:

Summarizing a dictionary of arrays in Python

related questions