views:

610

answers:

12

I have an iterable of entries on which I would like to gather some simple statistics, say the count of all numbers divisible by two and the count of all numbers divisible by three.

My first alternative, While only iterating through the list once and avoiding the list expansion (and keeping the split loop refactoring in mind), looks rather bloated:

(alt 1)

r = xrange(1, 10)

twos = 0
threes = 0

for v in r:
  if v % 2 == 0:
    twos+=1
  if v % 3 == 0:
    threes+=1

print twos
print threes

This looks rather nice, but has the drawback of expanding the expression to a list:

(alt 2)

r = xrange(1, 10)

print len([1 for v in r if v % 2 == 0])
print len([1 for v in r if v % 3 == 0])

What I would really like is something like a function like this:

(alt 3)

def count(iterable):
  n = 0
  for i in iterable:
    n += 1
  return n

r = xrange(1, 10)

print count(1 for v in r if v % 2 == 0)
print count(1 for v in r if v % 3 == 0)

But this looks a lot like something that could be done without a function. The final variant is this:

(alt 4)

r = xrange(1, 10)

print sum(1 for v in r if v % 2 == 0)
print sum(1 for v in r if v % 3 == 0)

and while the smallest (and in my book probably the most elegant) it doesn't feel like it expresses the intent very well.

So, my question to you is:

Which alternative do you like best to gather these types of stats? Feel free to supply your own alternative if you have something better.

To clear up some confusion below:

  • In reality my filter predicates are more complex than just this simple test.
  • The objects I iterate over are larger and more complex than just numbers
  • My filter functions are more different and hard to parameterize into one predicate
+1  A: 

You could use the filter function.

It filters a list (or strictly an iterable) producing a new list containing only the items for which the specified function evaluates to true.

r = xrange(1, 10)

def is_div_two(n):
    return n % 2 == 0

def is_div_three(n):
    return n % 3 == 0

print len(filter(is_div_two,r))
print len(filter(is_div_three,r))

This is good as it allows you keep your statistics logic contained in a function and the intent of the filter should be pretty clear.

Dave Webb
won't the second print be consuming a used up iterator and therefore will print 0?
John Montgomery
A: 

I would definitely be looking at a numpy array instead of an iterable list if you just have numbers. You will almost certainly be able to do what you want with some terse arithmetic on the array.

Simon
Unfortunately it's actually a long iterable of rather large objects; the number things are just for easy reading :)
Henrik Gustafsson
+1  A: 

Well you could do one list comprehension/expression to get a set of tuples with that stat test in them and then reduce that down to get the sums.


r=xrange(10)
s=( (v % 2 == 0, v % 3 == 0) for v in r )
def add_tuples(t1,t2):
     return tuple(x+y for x,y in zip(t1, t2))
sums=reduce(add_tuples, s, (0,0)) # (0,0) is starting amount

print sums[0] # sum of numbers divisible by 2
print sums[1] # sum of numbers divisible by 3

Using generator expression etc should mean you'll only run through the iterator once (unless reduce does anything odd?). Basically you'd be doing map/reduce...

John Montgomery
Hah! I knew there was a way of using reduce for this :)
Henrik Gustafsson
A: 

Not as terse as you are looking for, but more efficient, it actually works with any iterable, not just iterables you can loop over multiple times, and you can expand the things to check for without complicating it further:

r = xrange(1, 10)

counts = {
   2: 0,
   3: 0,
}

for v in r:
    for q in counts:
        if not v % q:
            counts[q] += 1
        # Or, more obscure:
        #counts[q] += not v % q

for q in counts:
    print "%s's: %s" % (q, counts[q])
Thomas Wouters
+3  A: 

Alt 4! But maybe you should refactor the code to a function that takes an argument which should contain the divisible number (two and three). And then you could have a better functionname.

def methodName(divNumber, r):
  return sum(1 for v in r if v % divNumber == 0)


print methodName(2, xrange(1, 10))
print methodName(3, xrange(1, 10))
The 'real' tests are unfortunately slightly more different than that. Parameterizing them would only give me a headache :)
Henrik Gustafsson
+7  A: 

Having to iterate over the list multiple times isn't elegant IMHO.

I'd probably create a function that allows doing:

twos, threes = countmatching(xrange(1,10),
                             lambda a: a % 2 == 0,
                             lambda a: a % 3 == 0)

A starting point would be something like this:

def countmatching(iterable, *predicates):
    v = [0] * len(predicates)
    for e in iterable:
        for i,p in enumerate(predicates):
            if p(e):
                v[i] += 1
    return tuple(v)

Btw, "itertools recipes" has a recipe for doing much like your alt4.

def quantify(seq, pred=None):
    "Count how many times the predicate is true in the sequence"
    return sum(imap(pred, seq))
Anders Waldenborg
Re. iterating twice, it has some things going for it wrt clarity, but other than that, won't the decreased overhead from iterating once be eaten up by the amount of non-C-code executed?
Henrik Gustafsson
Of course, if I only iterate once it works on one-shot iterables too, duh :)Didn't think that far.
Henrik Gustafsson
I like your solution. But you say: "having to iterate over the list multiple times isn't elegant". If there is n values and m predicates, you iterate n times on the m predicates, so what's the advantage over iterating m times on n values? Note: I'm a Python newbie. Cheers!
Sébastien RoccaSerra
I think the reason for iterating once is that not all iterables can be iterated over several times. Mine can, fortunately :)
Henrik Gustafsson
A: 
from itertools import groupby
from collections import defaultdict

def multiples(v):
    return 2 if v%2==0 else 3 if v%3==0 else None
d = defaultdict(list)

for k, values in groupby(range(10), multiples):
    if k is not None:
        d[k].extend(values)
ironfroggy
Cool solution, although the stats are not updated correctly when an item is divisible by both two and three.
Henrik Gustafsson
A: 

The idea here is to use reduction to avoid repeated iterations. Also, this does not create any extra data structures, if memory is an issue for you. You start with a dictionary with your counters ({'div2': 0, 'div3': 0}) and increment them along the iteration.

def increment_stats(stats, n):
    if n % 2 == 0: stats['div2'] += 1
    if n % 3 == 0: stats['div3'] += 1
    return stats

r = xrange(1, 10)
stats = reduce(increment_stats, r, {'div2': 0, 'div3': 0})
print stats

If you want to count anything more complicated than divisors, it would be appropriate to use a more object-oriented approach (with the same advantages), encapsulating the logic for stats extraction.

class Stats:

    def __init__(self, div2=0, div3=0):
        self.div2 = div2
        self.div3 = div3

    def increment(self, n):
        if n % 2 == 0: self.div2 += 1
        if n % 3 == 0: self.div3 += 1
        return self

    def __repr__(self):
        return 'Stats(%d, %d)' % (self.div2, self.div3)

r = xrange(1, 10)
stats = reduce(lambda stats, n: stats.increment(n), r, Stats())
print stats

Please point out any mistakes.

@Henrik: I think the first approach is less maintainable since you have to control initialization of the dictionary in one place and update in another, as well as having to use strings to refer to each stat (instead of having attributes). And I do not think OO is overkill in this case, for you said the predicates and objects will be complex in your application. In fact if the predicates were really simple, I wouldn't even bother to use a dictionary, a single fixed size list would be just fine. Cheers :)

seuvitor
Strange and intriguing use of reduce :)And yes, for more complex scenarios a slightly more OO approach would be preferred, but I don't quite see how your version scales better (maintenance/reuse-wise) than the original ones.
Henrik Gustafsson
A: 

Inspired by the OO-stab above, I had to try my hands on one as well (although this is way overkill for the problem I'm trying to solve :)

class Stat(object):
  def update(self, n):
    raise NotImplementedError

  def get(self):
    raise NotImplementedError


class TwoStat(Stat):
  def __init__(self):
    self._twos = 0

  def update(self, n):
    if n % 2 == 0: self._twos += 1

  def get(self):
    return self._twos


class ThreeStat(Stat):
  def __init__(self):
    self._threes = 0

  def update(self, n):
    if n % 3 == 0: self._threes += 1

  def get(self):
    return self._threes


class StatCalculator(object):
  def __init__(self, stats):
    self._stats = stats

  def calculate(self, r):
    for v in r:
      for stat in self._stats:
        stat.update(v)
    return tuple(stat.get() for stat in self._stats)


s = StatCalculator([TwoStat(), ThreeStat()])

r = xrange(1, 10)
print s.calculate(r)
Henrik Gustafsson
+1  A: 

True booleans are coerced to unit integers, and false booleans to zero integers. So if you're happy to use scipy or numpy, make an array of integers for each element of your sequence, each array containing one element for each of your tests, and sum over the arrays. E.g.

>>> sum(scipy.array([c % 2 == 0, c % 3 == 0]) for c in xrange(10))
array([5, 4])
fivebells
A: 

Alt 3, for the reason that it doesn't use memory proportional to the number of "hits". Given a pathological case like xrange(one_trillion), many of the other offered solutions would fail badly.

Just Some Guy
I think alt 4 has the same properties
Henrik Gustafsson
+1  A: 

I would choose a small variant of your (alt 4):

def count(predicate, list):
    print sum(1 for x in list if predicate(x))

r = xrange(1, 10)

count(lambda x: x % 2 == 0, r)
count(lambda x: x % 3 == 0, r)
# ...

If you want to change what count does, change its implementation in one place.

Note: since your predicates are complex, you'll probably want to define them in functions instead of lambdas. And so you'll probably want to put all this in a class rather than the global namespace.

Sébastien RoccaSerra
Changing what count does wouldn't be very common, but creating a function named count helps showing the intent in a nice way.Re. your note; certainly, but that would be outside the scope of the question :)
Henrik Gustafsson