views:

122

answers:

3

I got the following dictionary:

mydict = {
  'foo': [1,19,2,3,24,52,2,6],          # sum: 109
  'bar': [50,5,9,7,66,3,2,44],          # sum: 186
  'another': [1,2,3,4,5,6,7,8],         # sum:  36
  'entry': [0,0,0,2,99,4,33,55],        # sum: 193
  'onemore': [21,22,23,24,25,26,27,28]  # sum: 196
}

I need to efficiently filter out and sort the top x entries by the sum of the array.

For example, the Top 3 sorted and filtered list for the example above would be

sorted_filtered_dict = {
  'onemore': [21,22,23,24,25,26,27,28], # sum: 196
  'entry': [0,0,0,2,99,4,33,55],        # sum: 193
  'bar': [50,5,9,7,66,3,2,44]           # sum: 186
}

I'm fairly new to Python, and tried it myself with chaining a sum and filter function on a lambda function, but struggled with the actual syntax.

+6  A: 

It's easy to do with a sort:

sorted(mydict.iteritems(), key=lambda tup: sum(tup[1]), reverse=True)[:3]

This is reasonable if the ratio is similar to this one (3 / 5). If it's larger, you'll want to avoid the sort (O(n log n)), since top 3 can be done in O(n). For instance, using heapq, the heap module:

heapq.nlargest(3, mydict.iteritems(), key=lambda tup: sum(tup[1]))

This is O(n + 3 log n), since assembly the initial heap is O(n) and re-heapifying is O(log n).

EDIT: If you're using Python 2.7 or later, you can easily convert to a OrderedDict (equivalent version for Python 2.4+):

OrderedDict(heapq.nlargest(3, mydict.iteritems(), key=lambda tup: sum(tup[1])))

OrderedDict has the same API as dict, but remembers insertion order.

Matthew Flaschen
How do you get O(n + 3 log n), it should be O(n log k), or when k = 3 the constant cancels out and you'll get O(n)
Ants Aasma
In my real world example, it's rather the top 100 out of a couple hundred thousand, therefore the heapq example is probably to be preferred. Thanks.
poezn
Just realized that this doesn't give me a dictionary but an array of arrays. Any ideas?
poezn
@Ants, can you explain how you calculate that? Wikipedia [gives](http://en.wikipedia.org/wiki/Selection_algorithm#Language_support) O(n + k log n), and I think that checks out, based on the reason I explained in my answer.
Matthew Flaschen
+2  A: 

For such a small slice it's not worth using islice

sorted(mydict.iteritems(), key=lambda (k,v): sum(v), reverse=True)[:3]
gnibbler
+1  A: 

The best that I can come up with is:

  decorated = ((sum(mydict[key]), key) for key in mydict)  
  topn = heapq.nlargest(n, decorated)
  result = dict([(key, mydict[key]) for s, key in topn])

Using the decorate-sort-undecorate pattern eliminates the need to pass in a lambda so that as much code as possible can run in C. (I don't know if this applies to the constructor function of the heap but it's a good habit to get into anyways)

another advantage of this pattern in this case is that it lets us recover the keys after the sort to build the result dictionary

aaronasterling