ansaurus

Question

filtering lists in python

Answer 1

+9 A:

Cast foo to a set, if you don't care about element order.

fatcat1111 2009-10-20 18:14:26

sri has beaten you by 2 seconds :p

Hellnar 2009-10-20 18:15:55

@Helinar: that's not true, fatcat1111 was faster by 1 second

SilentGhost 2009-10-20 18:16:32

yep, i think fatcat1111 was a second faster than me, so you should accept the above answer if helps you equally as mine :-)

sri 2009-10-20 18:25:53

@sri - Your answer offers more information - it shows a code example (however brief) and mentions the 2.5+ requirement.

Chris Lutz 2009-10-21 00:37:45

There is no "cast". Perhaps you mean "coerce".

pst 2009-10-21 04:41:30

@pst - Thanks, I've never been completely clear on the difference there. IIRC, values (not symbols/variables) have types, and are cast; one coerces variables. Or is it the other way around? Looks like I have a SO question here!

fatcat1111 2009-10-21 16:44:05

Answer 2

+17 A:

list(set(foo)) if you are using Python 2.5 or greater, but that doesn't maintain order.

sri 2009-10-20 18:14:27

Answer 3

+2 A:

If you care about order a readable way is the following

def filter_unique(a_list):
    characters = set()
    result = []
    for c in a_list:
        if not c in characters:
            characters.add(c)
            result.append(c)
    return result

Depending on your requirements of speed, maintanability, space consumption, you could find the above unfitting. In that case, specify your requirements and we can try to do better :-)

Francesco 2009-10-20 18:21:41

+1 Your answer inspired me to create a class that allows using the built-in `filter()` to do the same thing. So thanks for the inspiration.

Chris Lutz 2009-10-21 01:01:14

@Chris: My pleasure :-)I thought that using filter would have been slightly more advanced and so went for a very simple solution. If you like filter consider using the (excellent) module itertools and in particular itertools.ifilter and itertools.ifilterfalse

Francesco 2009-10-21 07:49:39

Answer 4

+2 A:

>>> bar = []
>>> for i in foo:
    if i not in bar:
     bar.append(i)

>>> bar
['a', 'b', 'c', 'd']

this would be the most straightforward way of removing duplicates from the list and preserving the order as much as possible (even though "order" here is inherently wrong concept).

SilentGhost 2009-10-20 18:29:51

Should at least mention the O(n^2) performance characteristic, no?

Triptych 2009-10-20 18:33:21

it's `O(n*k)`, isn't?

SilentGhost 2009-10-20 18:40:00

@SilentGhost - What's `k` there? Is it a constant (i.e. not part of Big-O notation) or is it some other factor?

Chris Lutz 2009-10-21 00:36:30

Clearly O(n**2)

hughdbrown 2009-10-21 04:07:32

I don't think it's O(n^2). It may be O(not very good), but it's not _that_ bad.

Chris Lutz 2009-10-21 04:49:12

@Chris - I believe Triptych and hughdbrown are pointing out the append operation on bar is likely not constant time (a quick but unmotivated docs.python.org check didn't find much). And I do think SilentGhost meant a constant.

pbh101 2009-10-21 04:50:50

Never mind. I totally missed the membership test in the if statement the first seven or so times I read the function. Yep, O(n^2).

pbh101 2009-10-21 04:55:24

what pbh said. Membership test in a list is O(n). That * the loop = O(n^2). This page is helpful: http://wiki.python.org/moin/TimeComplexity

Triptych 2009-10-22 15:18:26

what I was referring to is that `bar` and `foo` have different size. it would be `n**2` only if `foo` doesn't have any duplicates, it's better otherwise.

SilentGhost 2009-10-22 15:36:19

Answer 5

+1 A:

If you write a function to do this i would use a generator, it just wants to be used in this case.

def unique(iterable):
    yielded = set()
    for item in iterable:
        if item not in yielded:
            yield item
            yielded.add(item)

DasIch 2009-10-21 00:33:40

Answer 6

+2 A:

Since there isn't an order-preserving answer with a list comprehension, I propose the following:

>>> temp = set()
>>> [c for c in foo if c not in temp and (temp.add(c) or True)]
['a', 'b', 'c', 'd']

which could also be written as

>>> temp = set()
>>> filter(lambda c: c not in temp and (temp.add(c) or True), foo)
['a', 'b', 'c', 'd']

Depending on how many elements are in foo, you might have faster results through repeated hash lookups instead of repeated iterative searches through a temporary list.

c not in temp verifies that temp does not have an item c; and the or True part forces c to be emitted to the output list when the item is added to the set.

Mark Rushakoff 2009-10-21 00:47:04

Why store items that have been found already in a hash of `None`s instead of a `set`?

Chris Lutz 2009-10-21 00:57:39

Because I didn't think that one through all the way.

Mark Rushakoff 2009-10-21 01:12:39

Answer 7

+1 A:

Inspired by Francesco's answer, rather than making our own filter()-type function, let's make the builtin do some work for us:

def unique(a, s=set()):
    if a not in s:
        s.add(a)
        return True
    return False

Usage:

uniq = filter(unique, orig)

This may or may not perform faster or slower than an answer that implements all of the work in pure Python. Benchmark and see. Of course, this only works once, but it demonstrates the concept. The ideal solution is, of course, to use a class:

class Unique(set):
    def __call__(self, a):
        if a not in self:
            self.add(a)
            return True
        return False

Now we can use it as much as we want:

uniq = filter(Unique(), orig)

Once again, we may (or may not) have thrown performance out the window - the gains of using a built-in function may be offset by the overhead of a class. I just though it was an interesting idea.

Chris Lutz 2009-10-21 00:55:38

What happens if you run this twice? `uniq = filter(Unique(), range(10)); print uniq`

hughdbrown 2009-10-21 05:40:49

Actually, I meant if you run this twice: `uniq = filter(unique, range(10)); print uniq`

hughdbrown 2009-10-21 05:56:51

The `unique` version only works once. Running it a second time on the same data will produce no data, because the function only has one set (the second argument). Running it twice on different data can produce unexpected results, as it will weed out the overlap of the two data sets as well as the duplicates of the second set. The function was my first version, and its limitations led me to create the class version, which suffers no such problems (and is also more generally useful). The function version was shown as a thought-process thing, nothing more.

Chris Lutz 2009-10-21 06:10:35

Answer 8

+1 A:

This is what you want if you need a sorted list at the end:

>>> foo = ['a','b','c','a','b','d','a','d']
>>> bar = sorted(set(foo))
>>> bar
['a', 'b', 'c', 'd']

hughdbrown 2009-10-21 04:06:16

list comprehension is redundant. could just say: bar = sorted(set(foo))

recursive 2009-10-21 04:20:34

Nice answer -- answers the question directly (although not entirely honestly IMOHO) which has output in natural ordering. +1.

pst 2009-10-21 04:46:51

@pst: Not entirely honestly? Like I am trying to pull something over on you? Or...what? I don't get where the OP asked for a stable sort, so I just sorted them because it looked like that *was* a requirement. @recursive: good call. I'll edit that.

hughdbrown 2009-10-21 05:32:06

ansaurus

tags:

views:

answers:

filtering lists in python

related questions