views:

333

answers:

4

I think list comprehensions may give me this, but I'm not sure: any elegant solutions in Python (2.6) in general for selecting unique objects in a list and providing a count?

(I've defined an __eq__ to define uniqueness on my object definition).

So in RDBMS-land, something like this:

CREATE TABLE x(n NUMBER(1));
INSERT INTO x VALUES(1);
INSERT INTO x VALUES(1);
INSERT INTO x VALUES(1);
INSERT INTO x VALUES(2);

SELECT COUNT(*), n FROM x
GROUP BY n;

Which gives:

COUNT(*) n
==========
3        1
1        2

So , here's my equivalent list in Python:

[1,1,1,2]

And I want the same output as the SQL SELECT gives above.

EDIT: The example I gave here was simplified, I'm actually processing lists of user-defined object-instances: just for completeness I include the extra code I needed to get the whole thing to work:

import hashlib

def __hash__(self):
    md5=hashlib.md5()
    [md5.update(i) for i in self.my_list_of_stuff]
    return int(md5.hexdigest(),16)

The __hash__ method was needed to get the set conversion to work (I opted for the list-comprehension idea that works in 2.6 [despite the fact that I learnt that involves an inefficiency (see comments) - my data set is small enough for that not be an issue]). my_list_of_stuff above is a list of (Strings) on my object definition.

+5  A: 

Not easily doable as a list comprehension.

from collections import defaultdict
def group_by( someList ):
    counts = defaultdict(int)
    for value in someList:
        counts[value.aKey] += 1
    return counts

This is a very Pythonic solution. But not a list comprehension.

S.Lott
why the downvote?
Adam Bernier
This way works, even if it's not a list comprehension at all.
Broam
Very 'awk' like solution as it goes as well!Small problem when running it on my system:...counts = defaultdict(int)NameError: global name 'defaultdict' is not defined...( Python 2.6.2 (r262:71605, Apr 14 2009, 22:40:02) [MSC v.1500 32 bit (Intel)] onwin32 )
monojohnny
@monojohnny: `>>> from collections import defaultdict`
SilentGhost
@Adam: Poster asked for list comprehension method, as you provided. However, even without that, the most "Pythonic" solution would be to use the built-in Counter class rather than rolling out your own.
BlueRaja - Danny Pflughoeft
@BlueRaja: Counter is not available in py2.6, and OP is clearly misunderstands the list comprehensions. Because accepted solution has a piss-poor performance.
SilentGhost
@SilentGhost; thanks I should have worked that out really !! Doh!@BlueRaja - I have clarified the title - I was aiming for List Comprehension (cos they look nice), but was accepting other ways as well.
monojohnny
@BlueRaja: I agree; the built-in solution is always preferable.
Adam Bernier
@SilentGhosht: re:"OP is clearly misunderstands the list comprehensions" - in fact as far OP is concerned I think they'd be better being referred to as 'list-UNcomprehensions'..yuk yuk :)
monojohnny
@monojohnny: they come directly from the set theory http://en.wikipedia.org/wiki/List_comprehension#Overview
SilentGhost
Yeah, understood - was just trying a bit of self-deprecating humour :-)Actually (I commented above on this also) - 'piss poor performance' might be a bit strong : I believe (somebody correct me if I'm wrong), that the accepted answer is 2*n as opposed to the optimum 1*n - not such a big deal: especially since I read that list-comps are faster in general than python loops.
monojohnny
+6  A: 

Lennart Regebro provided a nice one-liner that does what you want:

>>> values = [1,1,1,2]
>>> print [(x,values.count(x)) for x in set(values)]
[(1, 3), (2, 1)]

As S.Lott mentions, a defaultdict can do the same thing.

Adam Bernier
nice...cheers - works fine on my 2.6 install.
monojohnny
The defaultdict solution will do a lot better on larger `values` lists, because it will only loop the values once, instead of once plus once for each unique value.
Thomas Wouters
@Thomas: thanks for adding that.
Adam Bernier
Actually (although very elegant code: so the tick-stays) , this is a little more tricky for instances of my user-defined objects, as it appears I have to define a '__hash__' method that returns an 'int' , which (in my particular case) is non-trivial (its a big lump of text) : I might try an MD5 checksum or something(suggestions?), but for now, I think I'll go with the (not-unelegant) 'def group_by(someList)' given by S.Lott.
monojohnny
@Thomas - understood, good point; in my case I have a few hundred items at most.
monojohnny
A few hundred is quite a few :) A few hundred loops through a few hundred items will take a few hundred times longer than one loop through a few hundred items.
Thomas Wouters
I appreciate everybody's concern about performance - but I'm not designing enterprise software here - I'm running a text filter against a few meg of logfiles.At is happens , it runs just great for my needs. I might swap out the algorithm at some point, if the script has more than a one-off use.In fact, the script completes in about a second, including the additional overhead of text-processing to generate my objects and to format the final report.I believe off the top of my head, that the algorithm is 2*n, as opposed to 1*n: so it should scale in a linear fashion as it happens.
monojohnny
+8  A: 
>>> from collections import Counter
>>> Counter([1,1,1,2])
Counter({1: 3, 2: 1})

Counter only available in py3.1, inherits from the dict.

SilentGhost
Why the downvote?
SilentGhost
+1 for the easiest way to do this in 3.x
Adam Bernier
+4  A: 

You can use groupby from the itertools module:

Make an iterator that returns consecutive keys and groups from the iterable. The key is a function computing a key value for each element. If not specified or is None, key defaults to an identity function and returns the element unchanged. Generally, the iterable needs to already be sorted on the same key function.

>>> a = [1,1,1,2]
>>> [(len(list(v)), key) for (key, v) in itertools.groupby(sorted(a))]
[(3, 1), (1, 2)]

I would assume its runtime is worse than the dict-based solutions by SilentGhost or S.Lott since it has to sort the input sequence, but you should time that yourself. It is a list comprehension, though. It should be faster than Adam Bernier's solution, since it doesn't have to do repeated linear scans of the input sequence. If needed, the sorted call can be avoided by sorting the input sequence in-line.

Torsten Marek
+1 for the effort, and I agree that either S.Lott or SilentGhost's solutions are superior.
Adam Bernier
Don't agree before you've profiled them;)
Torsten Marek
@Torsten: good point :-) To clarify, I agree based on the fact that those are built-in solutions; already tested and debugged for us.
Adam Bernier
@Adam: Sure, for 3.1, Counter is the way to go.
Torsten Marek