ansaurus

Question

Most pythonic way of counting matching elements in something iterable

Answer 1

+1 A:

You could use the filter function.

It filters a list (or strictly an iterable) producing a new list containing only the items for which the specified function evaluates to true.

r = xrange(1, 10)

def is_div_two(n):
    return n % 2 == 0

def is_div_three(n):
    return n % 3 == 0

print len(filter(is_div_two,r))
print len(filter(is_div_three,r))

This is good as it allows you keep your statistics logic contained in a function and the intent of the filter should be pretty clear.

Dave Webb 2008-10-01 10:51:36

won't the second print be consuming a used up iterator and therefore will print 0?

John Montgomery 2008-10-01 11:14:27

Answer 2

A:

I would definitely be looking at a numpy array instead of an iterable list if you just have numbers. You will almost certainly be able to do what you want with some terse arithmetic on the array.

Simon 2008-10-01 10:53:12

Unfortunately it's actually a long iterable of rather large objects; the number things are just for easy reading :)

Henrik Gustafsson 2008-10-01 10:54:37

Answer 3

+1 A:

Well you could do one list comprehension/expression to get a set of tuples with that stat test in them and then reduce that down to get the sums.


r=xrange(10)
s=( (v % 2 == 0, v % 3 == 0) for v in r )
def add_tuples(t1,t2):
     return tuple(x+y for x,y in zip(t1, t2))
sums=reduce(add_tuples, s, (0,0)) # (0,0) is starting amount

print sums[0] # sum of numbers divisible by 2
print sums[1] # sum of numbers divisible by 3

Using generator expression etc should mean you'll only run through the iterator once (unless reduce does anything odd?). Basically you'd be doing map/reduce...

John Montgomery 2008-10-01 10:53:51

Hah! I knew there was a way of using reduce for this :)

Henrik Gustafsson 2008-10-01 12:15:04

Answer 4

A:

Not as terse as you are looking for, but more efficient, it actually works with any iterable, not just iterables you can loop over multiple times, and you can expand the things to check for without complicating it further:

r = xrange(1, 10)

counts = {
   2: 0,
   3: 0,
}

for v in r:
    for q in counts:
        if not v % q:
            counts[q] += 1
        # Or, more obscure:
        #counts[q] += not v % q

for q in counts:
    print "%s's: %s" % (q, counts[q])

Thomas Wouters 2008-10-01 10:56:46

Answer 5

+3 A:

Alt 4! But maybe you should refactor the code to a function that takes an argument which should contain the divisible number (two and three). And then you could have a better functionname.

def methodName(divNumber, r):
  return sum(1 for v in r if v % divNumber == 0)


print methodName(2, xrange(1, 10))
print methodName(3, xrange(1, 10))

2008-10-01 11:04:46

The 'real' tests are unfortunately slightly more different than that. Parameterizing them would only give me a headache :)

Henrik Gustafsson 2008-10-01 12:14:14

Answer 6

+7 A:

Having to iterate over the list multiple times isn't elegant IMHO.

I'd probably create a function that allows doing:

twos, threes = countmatching(xrange(1,10),
                             lambda a: a % 2 == 0,
                             lambda a: a % 3 == 0)

A starting point would be something like this:

def countmatching(iterable, *predicates):
    v = [0] * len(predicates)
    for e in iterable:
        for i,p in enumerate(predicates):
            if p(e):
                v[i] += 1
    return tuple(v)

Btw, "itertools recipes" has a recipe for doing much like your alt4.

def quantify(seq, pred=None):
    "Count how many times the predicate is true in the sequence"
    return sum(imap(pred, seq))

Anders Waldenborg 2008-10-01 11:09:14

Re. iterating twice, it has some things going for it wrt clarity, but other than that, won't the decreased overhead from iterating once be eaten up by the amount of non-C-code executed?

Henrik Gustafsson 2008-10-01 12:13:26

Of course, if I only iterate once it works on one-shot iterables too, duh :)Didn't think that far.

Henrik Gustafsson 2008-10-01 15:30:00

I like your solution. But you say: "having to iterate over the list multiple times isn't elegant". If there is n values and m predicates, you iterate n times on the m predicates, so what's the advantage over iterating m times on n values? Note: I'm a Python newbie. Cheers!

Sébastien RoccaSerra 2008-10-02 15:38:32

I think the reason for iterating once is that not all iterables can be iterated over several times. Mine can, fortunately :)

Henrik Gustafsson 2008-10-02 16:51:05

Answer 7

A:

from itertools import groupby
from collections import defaultdict

def multiples(v):
    return 2 if v%2==0 else 3 if v%3==0 else None
d = defaultdict(list)

for k, values in groupby(range(10), multiples):
    if k is not None:
        d[k].extend(values)

ironfroggy 2008-10-01 11:26:34

Cool solution, although the stats are not updated correctly when an item is divisible by both two and three.

Henrik Gustafsson 2008-10-01 15:26:24

Answer 8

A:

The idea here is to use reduction to avoid repeated iterations. Also, this does not create any extra data structures, if memory is an issue for you. You start with a dictionary with your counters ({'div2': 0, 'div3': 0}) and increment them along the iteration.

def increment_stats(stats, n):
    if n % 2 == 0: stats['div2'] += 1
    if n % 3 == 0: stats['div3'] += 1
    return stats

r = xrange(1, 10)
stats = reduce(increment_stats, r, {'div2': 0, 'div3': 0})
print stats

If you want to count anything more complicated than divisors, it would be appropriate to use a more object-oriented approach (with the same advantages), encapsulating the logic for stats extraction.

class Stats:

    def __init__(self, div2=0, div3=0):
        self.div2 = div2
        self.div3 = div3

    def increment(self, n):
        if n % 2 == 0: self.div2 += 1
        if n % 3 == 0: self.div3 += 1
        return self

    def __repr__(self):
        return 'Stats(%d, %d)' % (self.div2, self.div3)

r = xrange(1, 10)
stats = reduce(lambda stats, n: stats.increment(n), r, Stats())
print stats

Please point out any mistakes.

@Henrik: I think the first approach is less maintainable since you have to control initialization of the dictionary in one place and update in another, as well as having to use strings to refer to each stat (instead of having attributes). And I do not think OO is overkill in this case, for you said the predicates and objects will be complex in your application. In fact if the predicates were really simple, I wouldn't even bother to use a dictionary, a single fixed size list would be just fine. Cheers :)

seuvitor 2008-10-01 13:32:41

Strange and intriguing use of reduce :)And yes, for more complex scenarios a slightly more OO approach would be preferred, but I don't quite see how your version scales better (maintenance/reuse-wise) than the original ones.

Henrik Gustafsson 2008-10-01 15:18:28

Answer 9

A:

Inspired by the OO-stab above, I had to try my hands on one as well (although this is way overkill for the problem I'm trying to solve :)

class Stat(object):
  def update(self, n):
    raise NotImplementedError

  def get(self):
    raise NotImplementedError


class TwoStat(Stat):
  def __init__(self):
    self._twos = 0

  def update(self, n):
    if n % 2 == 0: self._twos += 1

  def get(self):
    return self._twos


class ThreeStat(Stat):
  def __init__(self):
    self._threes = 0

  def update(self, n):
    if n % 3 == 0: self._threes += 1

  def get(self):
    return self._threes


class StatCalculator(object):
  def __init__(self, stats):
    self._stats = stats

  def calculate(self, r):
    for v in r:
      for stat in self._stats:
        stat.update(v)
    return tuple(stat.get() for stat in self._stats)


s = StatCalculator([TwoStat(), ThreeStat()])

r = xrange(1, 10)
print s.calculate(r)

Henrik Gustafsson 2008-10-01 15:35:20

Answer 10

+1 A:

True booleans are coerced to unit integers, and false booleans to zero integers. So if you're happy to use scipy or numpy, make an array of integers for each element of your sequence, each array containing one element for each of your tests, and sum over the arrays. E.g.

>>> sum(scipy.array([c % 2 == 0, c % 3 == 0]) for c in xrange(10))
array([5, 4])

fivebells 2008-10-01 16:47:13

Answer 11

A:

Alt 3, for the reason that it doesn't use memory proportional to the number of "hits". Given a pathological case like xrange(one_trillion), many of the other offered solutions would fail badly.

Just Some Guy 2008-10-01 16:59:00

I think alt 4 has the same properties

Henrik Gustafsson 2008-10-01 17:27:15

Answer 12

+1 A:

I would choose a small variant of your (alt 4):

def count(predicate, list):
    print sum(1 for x in list if predicate(x))

r = xrange(1, 10)

count(lambda x: x % 2 == 0, r)
count(lambda x: x % 3 == 0, r)
# ...

If you want to change what count does, change its implementation in one place.

Note: since your predicates are complex, you'll probably want to define them in functions instead of lambdas. And so you'll probably want to put all this in a class rather than the global namespace.

Sébastien RoccaSerra 2008-10-02 16:16:40

Changing what count does wouldn't be very common, but creating a function named count helps showing the intent in a nice way.Re. your note; certainly, but that would be outside the scope of the question :)

Henrik Gustafsson 2008-10-02 16:49:29

ansaurus

tags:

views:

answers:

Most pythonic way of counting matching elements in something iterable

related questions