views:

133

answers:

1

In a Python program I'm writing I've compared using a for loop and increment variables versus list comprehension with map(itemgetter) and len() when counting entries in dictionaries which are in a list. It takes the same time using a each method. Am I doing something wrong or is there a better approach?

Here is a greatly simplified and shortened data structure:

list = [
  {'key1': True, 'dontcare': False, 'ignoreme': False, 'key2': True, 'filenotfound': 'biscuits and gravy'},
  {'key1': False, 'dontcare': False, 'ignoreme': False, 'key2': True, 'filenotfound': 'peaches and cream'},
  {'key1': True, 'dontcare': False, 'ignoreme': False, 'key2': False, 'filenotfound': 'Abbott and Costello'},
  {'key1': False, 'dontcare': False, 'ignoreme': True, 'key2': False, 'filenotfound': 'over and under'},
  {'key1': True, 'dontcare': True, 'ignoreme': False, 'key2': True, 'filenotfound': 'Scotch and... well... neat, thanks'}
]

Here is the for loop version:

#!/usr/bin/env python
# Python 2.6
# count the entries where key1 is True
# keep a separate count for the subset that also have key2 True

key1 = key2 = 0
for dictionary in list:
    if dictionary["key1"]:
        key1 += 1
        if dictionary["key2"]:
            key2 += 1
print "Counts: key1: " + str(key1) + ", subset key2: " + str(key2)

Output for the data above:

Counts: key1: 3, subset key2: 2

Here is the other, perhaps more Pythonic, version:

#!/usr/bin/env python
# Python 2.6
# count the entries where key1 is True
# keep a separate count for the subset that also have key2 True
from operator import itemgetter
KEY1 = 0
KEY2 = 1
getentries = itemgetter("key1", "key2")
entries = map(getentries, list)
key1 = len([x for x in entries if x[KEY1]])
key2 = len([x for x in entries if x[KEY1] and x[KEY2]])
print "Counts: key1: " + str(key1) + ", subset key2: " + str(key2)

Output for the data above (same as before):

Counts: key1: 3, subset key2: 2

I'm a tiny bit surprised these take the same amount of time. I wonder if there's something faster. I'm sure I'm overlooking something simple.

One alternative I've considered is loading the data into a database and doing SQL queries, but the data doesn't need to persist and I'd have to profile the overhead of the data transfer, etc., and a database may not always be available.

I have no control over the original form of the data.

The code above is not going for style points.

+7  A: 

I think you're measuring incorrectly by swamping the code to be measured in a lot of overhead (running at top module level instead of in a function, doing output). Putting the two snippets into functions named forloop and withmap, and adding a * 100 to the list's definition (after the closing ]) to make the measurement a little substantial, I see, on my slow laptop:

$ py26 -mtimeit -s'import co' 'co.forloop()'
10000 loops, best of 3: 202 usec per loop
$ py26 -mtimeit -s'import co' 'co.withmap()'
10 loops, best of 3: 601 usec per loop

i.e., the allegedly "more pythonic" approach with map is three times slower than the plain for approach -- which tells you that it's not really "more pythonic";-).

The mark of good Python is simplicity, which, to me, recommends what I hubris-ly named...:

def thebest():
  entries = [d['key2'] for d in list if d['key1']]
  return len(entries), sum(entries)

which, on measurement, saves between 10% and 20% time over the forloop approach.

Alex Martelli
As far as overhead swamping the measurement, as I said in my question: "greatly simplified and shortened data structure".
Dennis Williamson
@Dennis, putting code at module top level doesn't simplify things -- it just burdens the code with overhead that distorts the speed. Always keep meaningful code in functions -- whether it be for real work, or for measurement, that will always be best.
Alex Martelli
It was the data that I was referring to as "simplified and shortened".
Dennis Williamson
@Dennis, right, but that doesn't change the "overhead swamping the measurement" issue: top-module-level code pays overhead (wrt the same code in a function) every time it sets or gets a variable, and so in direct proportion to e.g. the number of legs through a loop. Also, I did use a `* 100` on the data, as I mentioned in my A, to get a reasonably sizable chunk of data for measurement, so the "shortening" was hopefully countermanded (I can't say about the "simplifying" -- whether it would or wouldn't affect relative performance depends on exactly what was simplified;-).
Alex Martelli
@Alex: Wow! `for i in range(10000000): a = i` takes 50% longer at top level than in a function. Thanks! Normally I would use functions anyway, but I just posted what I thought was straight-forward test code in my question (and hand-waved it as "not for style points"). As they say, "you learn something new every day".
Dennis Williamson
@Dennis, you're welcome -- I'm glad we continued the conversation in the comments so you ended up learning this!
Alex Martelli