views:

100

answers:

5

I am processing a CSV file and counting the unique values of column 4. So far I have coded this three ways. One uses "if key in dictionary", the second traps the KeyError and the third uses "DefaultDictionary". For example (where x[3] is the value from the file and "a" is a dictionary):

First way:

if x[3] in a:
    a[x[3]] += 1
else:
    a[x[3]] = 1

Second way:

try:
    b[x[3]] += 1
except KeyError:
    b[x[3]] = 1

Third way:

from collections import defaultdict
c = defaultdict(int)
c[x[3]] += 1

My question is: which way is more efficient... cleaner... better... etc. Or is there a better way. Both ways work and give the same answer, but I thought I would tap the hive mind as a learning case.

Thanks -

+1  A: 
from collections import Counter
Counter(a)
SilentGhost
"a" is my dictionary. Is this "Counter(x[3])?
Count Boxer
Isn't "counter" new in 3.1? I'm running 2.7.
Count Boxer
@Count: [it exist in 2.7](http://docs.python.org/library/collections.html#collections.Counter)
SilentGhost
In Count Boxer's example, `a` is his count dictionary, not his source data. Implicit in his 3 examples is that they are in a body of a for-loop iterating over his csv file. All that to say, `a` shouldn't be passed to the constructor, because it's his result, not his source.
Steven Rumbalski
Dang it! Lied to by my IDE! It didn't show up in the drop down list, so I just assumed...
Count Boxer
A: 

Since you don't have access to Counter, your best bet is your third approach. It's much cleaner and easier to read. In addition, it doesn't have the perpetual testing (and branching) that the first two approaches have, which makes it more efficient.

chrisaycock
+5  A: 

Use collections.Counter. Counter is syntactic sugar for defaultdict(int), but what's cool about it is that it accepts an iterable in the constructor, thus saving an extra step (I assume all of your examples above are wrapped in a for-loop.)

from collections import Counter
count = Counter(x[3] for x in my_csv_reader)

Prior to the introduction of collections.Counter, collections.defaultdict was the most idiomatic for this task, so for users < 2.7, use defaultdict.

from collections import defaultdict
count = defaultdict(int)
for x in my_csv_reader:
    count[x[3]] += 1
Steven Rumbalski
I'm going to claim Newbie status on this clarification. Does the "x" in the "for x" portion refer to the individual line returned by the csv.reader? I "think" I implemented the csv reader:my_csv_reader = csv.reader(open('c:/Test.csv', 'rb'), delimiter=',', quotechar='|')d = Counter(x[3] for x in my_csv_reader)but d is more than twice as large as the other three methods.
Count Boxer
Yes, `x` refers to the line returned by the csv reader. What do you mean by d being more than twice as large? Is it giving the wrong counts? Please clarify.
Steven Rumbalski
Clarification: My code is incorrect and my question was not complete! There is an if statement surrounding the first three methods that removed duplicate values by matching the current line against the previous line before the value is added to the dictionary. The new method is not in that condition, so I got a total of all lines, not the unique ones!
Count Boxer
@Count Boxer I see, you'll surely need to code around that. One way would be to use a `set`. Something like `my_data = set(my_csv_reader)`, and then `d = Counter(x[3] for x in my_data)`.
Steven Rumbalski
@everybody: Counter is slower. See my answer.
John Machin
+1  A: 

You asked which was more efficient. Assuming that you are talking about execution speed: If your data is small, it doesn't matter. If it is large and typical, the "already exists" case will happen much more often than the "not in dict" case. This observation explains some of the results.

Below is some code which can be used with the timeit module to explore speed without file-reading overhead. I have taken the liberty of adding a 5th method, which is not uncompetetive and will run on any Python from at least 1.5.2 [tested] onwards.

from collections import defaultdict, Counter

def tally0(iterable):
    # DOESN'T WORK -- common base case for timing
    d = {}
    for item in iterable:
        d[item] = 1
    return d

def tally1(iterable):
    d = {}
    for item in iterable:
        if item in d:
            d[item] += 1
        else:
            d[item] = 1
    return d

def tally2(iterable):
    d = {}
    for item in iterable:
        try:
            d[item] += 1
        except KeyError:
            d[item] = 1
    return d

def tally3(iterable):
    d = defaultdict(int)
    for item in iterable:
        d[item] += 1

def tally4(iterable):
    d = Counter()
    for item in iterable:
        d[item] += 1

def tally5(iterable):
    d = {}
    dg = d.get
    for item in iterable:
        d[item] = dg(item, 0) + 1
    return d

Typical run (in Windows XP "Command Prompt" window):

prompt>\python27\python -mtimeit -s"t=1000*'now is the winter of our discontent made glorious summer by this son of york';import tally_bench as tb" "tb.tally1(t)"
10 loops, best of 3: 29.5 msec per loop

Here are the results (msec per loop):

0 base case   13.6
1 if k in d   29.5
2 try/except  26.1
3 defaultdict 23.4
4 Counter     79.4
5 d.get(k, 0) 29.2

Another timing trial:

prompt>\python27\python -mtimeit -s"from collections import defaultdict;d=defaultdict(int)" "d[1]+=1"
1000000 loops, best of 3: 0.309 usec per loop

prompt>\python27\python -mtimeit -s"from collections import Counter;d=Counter()" "d[1]+=1"
1000000 loops, best of 3: 1.02 usec per loop

The speed of Counter is possibly due to it being implemented partly in Python code whereas defaultdict is entirely in C (in 2.7, at least).

Note that Counter() is NOT just "syntactic sugar" for defaultdict(int) -- it implements a full bag aka multiset object -- see the docs for details; they may save you from reinventing the wheel if you need some fancy post-processing. If all you want to do is count things, use defaultdict.

Update in response to a question from @Steven Rumbalski: """ I'm curious, what happens if you move the iterable into the Counter constructor: d = Counter(iterable)? (I have python 2.6 and cannot test it.) """

tally6: just does d = Count(iterable); return d, takes 60.0 msecs

You could look at the source (collections.py in the SVN repository) ... here's what my Python27\Lib\collections.py does when iterable is not a Mapping instance:

            self_get = self.get
            for elem in iterable:
                self[elem] = self_get(elem, 0) + 1

Seen that code anywhere before? There's a whole lot of carry-on just to call code that's runnable in Python 1.5.2 :-O

John Machin
Nice timings. I'm curious, what happens if you move the iterable into the Counter constructor: `d = Counter(iterable)`? (I have python 2.6 and cannot test it.)
Steven Rumbalski
@Steven Rumbalski: see my updated answer.
John Machin
A: 

Use setdefault.

a[x[3]] = a.setdefault(x[3], 0) + 1

setdefault gets the value of the specified key (x[3] in this case), or if it does not exist, the specified value (0 in this case).

kindall
@kindall: Please explain why you think that `d[k] = d.setdefault(k, 0) + 1` would be better that `d[k] = d.get(k, 0) + 1`
John Machin
Yes, you are right. I was thinking of it because it would set the value, but you're doing that anyway.
kindall