views:

1087

answers:

6

Hi, I'm a python newbie, so maybe my question is very noob. Assume I have a list of words, and I want to find the number of times each word appears in that list. Obvious way to do this is:

words = "apple banana apple strawberry banana lemon"
uniques = set(words.split())
freqs = [(item, words.split.count(item)) for item in uniques]
print(freqs)

But I find this code not very good, because this way program runs through words list twice, once to build the set, and second time counting the number of appearances. Of course, I could write a function to run through list and do the counting, but that wouldn't be so pythonic. So, is there a more efficient and pythonic way?

+3  A: 

If you don't want to use the standard dictionary method (looping through the list incrementing the proper dict. key), you can try this:

>>> from itertools import groupby
>>> myList = words.split() # ['apple', 'banana', 'apple', 'strawberry', 'banana', 'lemon']
>>> [(k, len(list(g))) for k, g in groupby(sorted(myList))]
[('apple', 2), ('banana', 2), ('lemon', 1), ('strawberry', 1)]

It runs in O(n log n) time.

Nick Presta
+13  A: 

defaultdict to the rescue!

from collections import defaultdict

words = "apple banana apple strawberry banana lemon"

d = defaultdict(int)
for word in words.split():
    d[word] += 1

This runs in O(n).

Triptych
+1, collections.defaultdict is one of my favorite containers!
Alex Martelli
I'd say O(NlogN) if collection is a tree, or O(N) in average if it's a hash
Drakosha
dict is a hash.
S.Lott
or can skip the import statement and use a hash "d = {}"
Demi
"d = {}" doesn't work, or I'm too noob to do it right.
Daniyar
no, d={} won't work. The whole point of the defaultdict is that it will automatically initialize values to zero. A standard dict won't do that (you'll get a KeyError exception)
Triptych
+4  A: 

Standard approach:

from collections import defaultdict

words = "apple banana apple strawberry banana lemon"
words = words.split()
result = collections.defaultdict(int)
for word in words:
    result[word] += 1

print result

Groupby oneliner:

from itertools import groupby

words = "apple banana apple strawberry banana lemon"
words = words.split()

result = dict((key, len(list(group))) for key, group in groupby(sorted(words)))
print result
nosklo
Is there a difference in complexity? Does groupby use sorting? Then it seems to need O(nlogn) time?
Daniyar
Oops, it seems Nick Presta below has pointed out that the groupby approach uses O(nlogn).
Daniyar
+13  A: 

If you are using python 2.7+/3.1+, there is a Counter Class in the collections module which is purpose built to solve this type of problem:

>>> from collections import Counter
>>> words = "apple banana apple strawberry banana lemon"
>>> freqs = Counter(words.split())
>>> print(freqs)
Counter({'apple': 2, 'banana': 2, 'strawberry': 1, 'lemon': 1})
>>>

Since both 2.7 and 3.1 are still in beta it's unlikely you're using it, so just keep in mind that a standard way of doing this kind of work will soon be readily available.

sykora
Wow! This is the pythonic way. Thanks for sharing this.
Daniyar
pretty cool! * Goes back to Python 2.5 :( *
Triptych
It is also in python 2.7.
nosklo
A: 

Without defaultdict:

words = "apple banana apple strawberry banana lemon"
my_count = {}
for word in words.split():
    try: my_count[word] += 1
    except KeyError: my_count[word] = 1
Thomas Weigel
Seems slower than defaultdict in my tests
nosklo
splitting by a space is redundant. Also, you should use the dict.set_default method instead of the try/except.
Triptych
It's a lot slower because you are using Exceptions. Exceptions are very costly in almost any language. Avoid using them for logic branches.Look at my solution for an almost identical method, but without using Exceptions: http://stackoverflow.com/questions/893417/item-frequency-count-in-python/983434#983434
hopla
+1  A: 
freqs = {}
for word in words:
    freqs[word] = freqs.get(word, 0) + 1 # fetch and increment OR initialize

I think this results to the same as Triptych's solution, but without importing collections. Also a bit like Selinap's solution, but more readable imho. Almost identical to Thomas Weigel's solution, but without using Exceptions.

This could be slower than using defaultdict() from the collections library however. Since the value is fetched, incremented and then assigned again. Instead of just incremented. However using += might do just the same internally.

hopla