views:

395

answers:

8

This is my first python program -

Requirement: Read a file consisting of {adId UserId} in each line. For each adId, print the number of unique userIds.

Here is my code, put together from reading the python docs. Could you give me feedback on how I can write this in more python-ish way?

CODE :

import csv

adDict = {}
reader = csv.reader(open("some.csv"), delimiter=' ')
for row in reader:
    adId = row[0]
    userId = row[1]
    if ( adId in adDict ):
        adDict[adId].add(userId)
    else:
        adDict[adId] = set(userId)

for key, value in adDict.items():
    print (key, ',' , len(value))

Thanks.

+7  A: 

You could shorten the for-loop to this:

for row in reader:
  adDict.setdefault(row[0], set()).add(row[1])
sth
+1 - nice! I wish I'd known about the setDefault method earlier!
Smashery
Cool - so setdefault(row[0], set()) returns a set instance and then we do an add on that.
Schitti
+1  A: 

The only changes I'd make are extracting multiple elements from the reader at once, and using string formatting for print statements.

import csv

adDict = {}
reader = csv.reader(open("some.csv"), delimiter=' ')
# Can extract multiple elements from a list in the iteration statement:
for adId, userId in reader: 
    if ( adId in adDict ):
        adDict[adId].add(userId)
    else:
        adDict[adId] = set(userId)

for key, value in adDict.items():
    # I believe this gives you more control over how things are formatted:
    print ("%s, %d" % (key, len(value)))
Smashery
+1 - though note that Python will throw an exception if the CSV file passed does not have exactly 2 columns in every row.
Tom Leys
+15  A: 

Congratulations, your code is very nice. There are a few little tricks you could use to make it shorter/simpler.

There is a nifty object type called defaultdict which is provided by the collections module. Instead of having to check if adDict has an adId key, you can set up a defaultdict which acts like a regular dict, except that it automatically provides you with an empty set() when there is no key. So you can change

if ( adId in adDict ):
    adDict[adId].add(userId)
else:
    adDict[adId] = set(userId)

to simply

adDict[adId].add(userId)

Also, instead of

for row in reader:
    adId = row[0]
    userId = row[1]

you could shorten that to

for adId,userId in reader:

Edit: As Parker kindly points out in the comments,

for key, value in adDict.iteritems():

is the most efficient way to iterate over a dict, if you are going to use both the key and value in the loop. In Python3, you can use

for key, value in adDict.items():

since items() returns an iterator.

#!/usr/bin/env python
import csv
from collections import defaultdict

adDict = defaultdict(set)
reader = csv.reader(open("some.csv"), delimiter=' ')
for adId,userId in reader:
    adDict[adId].add(userId)
for key,value in adDict.iteritems():
    print (key, ',' , len(value))
unutbu
Cool - I'm going to read up on collections for more nifty tricks. Also, why would we say "import csv" but have to say "from collections import defaultdict"?
Schitti
@Schitti - using from is a shortcut. It means you can type defaultdict rather than collections.defaultdict . It is totally optional, though by importing the symbol defaultdict into the module's namespace has some other useful effects that an advanced programmer might like.
Tom Leys
@Schitti - Either form will work. If you just want to use the `import` statement, you can say `import collections.defaultdict` - but then, every time you use it, you have to refer to it as `collections.defaultdict`. Instead, as you can see in ~unutbu's answer, you now only have to refer to it as `defaultdict`. You can also change the name of a module as you import it; for instance, `import collections.defaultdict as defDi` or `from collections import defaultdict as defDi` - you can then use `defDi` instead of `defaultdict`
Smashery
`for key, value in d.items()` is much better practice than `for key in d : value = d[key]`. The first version visits each element exactly once. The second version visits each element once to get the list of keys and then must then lookup each key in the dict. It's O(n) versus O(n * log n).I think in this particular case, the readability argument could go either way, but the performance difference is significant enough (especially for large dicts) that the first version is the definite winner.
Parker
@Parker, thanks for your interesting comment! You had me convinced, but then I tried to write some demonstration code. To my surprise, it seems `key in d` might be faster than `key,value in d.items()`. The is the code I tested: http://paste.ubuntu.com/341327/.`python -mtimeit -s"import test" "test.key_in_d()"` took 409 ms per loop, `python -mtimeit -s"import test" "test.key_in_d_items()"` took 631 ms per loop. What do you think?
unutbu
@~unutbu, a couple of things. I'm guessing you're on Python2 where items() returns a list. I was thinking of Python3 where it returns a generator. To get the same behaviour in Python2 you need to use the somewhat uglier iteritems().Secondly, you aren't doing anything with value inside the loop. It might be that Python was smart enough to just optimise parts of it away. Maybe try creating a list of key, value tuples just to ensure that both parts are used.
Parker
@~unutbu, I just tested it here. "for key in d" was 335ms, "for key, value in d.items()" was indeed higher at 561ms, but "for key, value in d.iteritems()" was the definite winner at 206ms. Obviously for such a large dict, creating a list of key value pairs took quite some time and memory. :)
Parker
@Parker: Wonderful. Thanks for pointing this out. I'm editing my answer accordingly.
unutbu
+1  A: 

Just a few bits and pieces:

For extracting the row list into variables:

adId, userId = row

The if statement does not need braces:

if adId in adDict:

You could use exceptions to handle a missing Key in the dict, but both ways work well, e.g.:

try:
    adDict[adId].add(userId)
except KeyError:
    adDict[adId] = set(userId)
Gerald Kaszuba
Yes, although perhaps the downside is that exception based logic is more difficult to understand.
Schitti
+2  A: 

Instead of:

for row in reader:
    adId = row[0]
    userId = row[1]

Use automatic sequence unpacking:

for (adId, userId) in reader:

In:

if ( adId in adDict ):

You don't need parentheses.

Instead of:

if ( adId in adDict ):
    adDict[adId].add(userId)
else:
    adDict[adId] = set(userId)

Use defaultdict:

from collections import defaultdict
adDict = defaultDict(set)

# ...

adDict[adId].add(userId)

Or, if you're not allowed to use other modules by your professor, use setdefault():

adDict.setdefault(adId, set()).add(userId)

When printing:

for key, value in adDict.items():
    print (key, ',' , len(value))

Using string formatting might be easier to format:

print "%s,%s" % (key, len(value))

Or, if you're using Python 3:

print ("{0},{1}".format (key, len(value)))
John Millikin
Thanks John! I'm in the industry coding in C for the past 5 years, so no professor
Schitti
Ah, sorry; this sort of problem is often assigned as homework to beginning students.
John Millikin
+2  A: 

Since you only have a space-delimited file, I'd do:

from __future__ import with_statement
from collections import defaultdict

ads = defaultdict(set)
with open("some.csv") as f:
    for ad, user in (line.split(" ") for line in f):
        ads[ad].add(user)

for ad in ads:
    print "%s, %s" % (ad, len(ads[ad]))
Daniel Pryden
Thanks. This avoids creating a new reader class I presume. Could not upvote you since my reputation < 15.
Schitti
I like this. I was just going to recommend using `defaultdict(set)`. However, this code decomposes each line into a tuple, assuming that each line has only two elements in it. The OP's code did not make this assumption.
hughdbrown
@hughdbrown: You're right about the assumption that each line only has two elements. However, the OP's question does explicitly state that his file only has two elements on each line. In this case, I think it's acceptable to simply fail (with a tuple unpacking error) if there are ever a different number of elements on a line. Obviously, in a more production-hardened script, you'd want to code more defensively.
Daniel Pryden
+7  A: 

the line of code:

adDict[adId] = set(userId)

is unlikely to do what you want -- it will treat string userId as a sequence of letters, so for example if userId was aleax you'd get a set with four items, just like, say, set(['a', 'l', 'e', 'x']). Later, an .add(userId) when userId is aleax again will add a fifth item, the string 'aleax', because .add (differently from the set initializer, which takes an iterable as its argument) takes a single item as its argument.

To make a set with a single item, use set([userId]) instead.

This is a reasonably frequent bug so I wanted to explain it clearly. That being said, defaultdict as suggested in other answers is clearly the right approach (avoid setdefault, that was never a good design and doesn't have good performance either, as well as being pretty murky).

I would also avoid the kinda-overkill of csv in favor of a simple loop with a .split and .strip on each line...

Alex Martelli
Yes to `collections.defaultdict(set)`.
hughdbrown
+3  A: 

There are some great answers in here.

One trick I particularly like is to make my code easier to reuse in future like so

import csv

def parse_my_file(file_name):
     # some existing code goes here
     return aDict

if __name__ == "__main__":
     #this gets executed if this .py file is run directly, rather than imported
     aDict = parse_my_file("some.csv")
     for key, value in adDict.items():
         print (key, ',' , len(value))

Now you can import your csv parser from another module and get programmatic access to aDict.

Tom Leys