tags:

views:

597

answers:

7

I have a collections.defaultdict(int) that I'm building to keep count of how many times a key shows up in a set of data. I later want to be able to sort it (obviously by turning it into a list first) in a descending fashion, ordered with the highest values first. I created my dictionary like the following:

adict = defaultdict(int)

later on I do a bunch of:

adict['someval'] += 1
adict['anotherval'] +=1
adict['someval'] += 1

Ideally after that I'd like to get a print out of:

someval => 2
anotherval => 1
+3  A: 

Just sort the resulting dict by values:

for k, v in sorted(adict.items(), key=lambda kv: kv[1], reverse=True):
    print("%s => %s" % (k,v))
Ants Aasma
+2  A: 
from collections import defaultdict
adict = defaultdict(int)

adict['a'] += 1
adict['b'] += 3
adict['c'] += 5
adict['d'] += 2

for key, value in sorted(adict.items(), lambda a, b: cmp(a[1], b[1]), reverse=True):
    print "%r => %r" % (key, value)

>>> 
'c' => 5
'b' => 3
'd' => 2
'a' => 1

 

truppo
+14  A: 

To get the dictionary sorted:

from operator import itemgetter

sorted(adict.iteritems(), key=itemgetter(1), reverse=True)
Nadia Alramli
+1 This is a better answer than the accepted one for the reason J.F. Sebastian cites: more direct, only one use of key lookup per item.
hughdbrown
+7  A: 

A dict's keys, reverse-sorted by the corresponding values, can best be gotten as

sorted(adict, key=adict.get, reverse=True)

since you want key/value pairs, you could work on the items as all other answers suggest, or (to use the nifty adict.get bound method instead of itemgetters or weird lambdas;-),

[(k, adict[k]) for k in sorted(adict, key=adict.get, reverse=True)]

Edit: in terms of performance, there isn't much into it either way:

$ python -mtimeit -s'adict=dict((x,x**2) for x in range(-5,6))' '[(k, adict[k]) for k in sorted(adict, key=adict.get, reverse=True)]'
100000 loops, best of 3: 10.8 usec per loop
$ python -mtimeit -s'adict=dict((x,x**2) for x in range(-5,6)); from operator import itemgetter' 'sorted(adict.iteritems(), key=itemgetter(1), reverse=True)'
100000 loops, best of 3: 9.66 usec per loop
$ python -mtimeit -s'adict=dict((x,x**2) for x in range(-5,6))' 'sorted(adict.iteritems(), key=lambda (k,v): v, reverse=True)'
100000 loops, best of 3: 11.5 usec per loop

So, the .get-based solution is smack midway in performance between the two items-based ones -- slightly slower than the itemgetter, slightly faster than the lambda. In "bottleneck" cases, where those microsecond fractions are crucial to you, by all means do focus on that. In normal cases, where this operation is only one step within some bigger task and a microsecond more or less matters little, focusing on the simplicity of the get idiom is, however, also a reasonable alternative.

Alex Martelli
`adict.get` variant does key lookup twice for each item the dict. `[(k, v) for k, v in sorted(adict.iteritems(), key=itemgetter(1), reverse=True)]` does it once.
J.F. Sebastian
If you wanted keys with the same value sorted as well, is there something better than a "wierd lambda"?
tgray
@J.F. Sebastian: for that matter, you can drop the list comprehension stuff and just use: `sorted(adict.iteritems(), key=itemgetter(1), reverse=True)`. Like Nadia does.
hughdbrown
Found it, using itemgetter(1, 0) does the same as lambda (k, v): (v, k) and an order of magnitude faster.
tgray
A: 

"Invert" a dictionary.

from collections import defaultdict
inv_dict = defaultdict( list )
for key, value in adict:
    inv_dict[value].append( key )
max_value= max( inv_dict.keys() )

The set of keys with the maximum occurrence --

inv_dict[max_value]

The set of keys in descending order by occurrence --

for value, key_list in sorted( inv_dict ):
    print key_list, value
S.Lott
+1  A: 

If you're using the newest python 2.7 alpha, then you can use the Counter class in collections module:

c = Counter()

c['someval'] += 1
c['anotherval'] += 1
c['someval'] += 1

print c.most_common()

prints in the correct order:

[('someval', 2), ('anotherval', 1)]

The code used on 2.7 is available already and there's a version adapted to 2.5. Perhaps you want to use it to stay forward compatible with the native stdlib version that is about to be released.

nosklo
A: 

Note: I'm putting this in as an answer so that it gets seen. I don't want upvotes. If you want to upvote anyone, upvote Nadia.

The currently accepted answer gives timing results which are based on a trivially small dataset (size == 6 - (-5) == 11). The differences in cost of the various methods are masked by the overhead. A use case like what are the most frequent words in a text or most frequent names in a membership list or census involves much larger datasets.

Repeating the experiment with range(-n,n+1) (Windows box, Python 2.6.4, all times in microseconds):

n=5: 11.5, 9.34, 11.3
n=50: 65.5, 46.2, 68.1
n=500: 612, 423, 614

These results are NOT "slightly" different. The itemgetter answer is a clear winner on speed.

There was also mention of "the simplicity of the get idiom". Putting them close together for ease of comparison:

[(k, adict[k]) for k in sorted(adict, key=adict.get, reverse=True)] sorted(adict.iteritems(), key=itemgetter(1), reverse=True)

The get idiom not only looks up the dict twice (as J. F. Sebastian has pointed out), it makes one list (result of sorted()) then iterates over that list to create a result list. I'd call that baroque, not simple. YMMV.

John Machin