views:

228

answers:

8

hello, I want to filter repeated elements in my list for instance

foo = ['a','b','c','a','b','d','a','d']

I am only interested with:

['a','b','c','d']

What would be the efficient way to do achieve this ? Cheers

+9  A: 

Cast foo to a set, if you don't care about element order.

fatcat1111
sri has beaten you by 2 seconds :p
Hellnar
@Helinar: that's not true, fatcat1111 was faster by 1 second
SilentGhost
yep, i think fatcat1111 was a second faster than me, so you should accept the above answer if helps you equally as mine :-)
sri
@sri - Your answer offers more information - it shows a code example (however brief) and mentions the 2.5+ requirement.
Chris Lutz
There is no "cast". Perhaps you mean "coerce".
pst
@pst - Thanks, I've never been completely clear on the difference there. IIRC, values (not symbols/variables) have types, and are cast; one coerces variables. Or is it the other way around? Looks like I have a SO question here!
fatcat1111
+17  A: 

list(set(foo)) if you are using Python 2.5 or greater, but that doesn't maintain order.

sri
+2  A: 

If you care about order a readable way is the following

def filter_unique(a_list):
    characters = set()
    result = []
    for c in a_list:
        if not c in characters:
            characters.add(c)
            result.append(c)
    return result

Depending on your requirements of speed, maintanability, space consumption, you could find the above unfitting. In that case, specify your requirements and we can try to do better :-)

Francesco
+1 Your answer inspired me to create a class that allows using the built-in `filter()` to do the same thing. So thanks for the inspiration.
Chris Lutz
@Chris: My pleasure :-)I thought that using filter would have been slightly more advanced and so went for a very simple solution. If you like filter consider using the (excellent) module itertools and in particular itertools.ifilter and itertools.ifilterfalse
Francesco
+2  A: 
>>> bar = []
>>> for i in foo:
    if i not in bar:
     bar.append(i)

>>> bar
['a', 'b', 'c', 'd']

this would be the most straightforward way of removing duplicates from the list and preserving the order as much as possible (even though "order" here is inherently wrong concept).

SilentGhost
Should at least mention the O(n^2) performance characteristic, no?
Triptych
it's `O(n*k)`, isn't?
SilentGhost
@SilentGhost - What's `k` there? Is it a constant (i.e. not part of Big-O notation) or is it some other factor?
Chris Lutz
Clearly O(n**2)
hughdbrown
I don't think it's O(n^2). It may be O(not very good), but it's not _that_ bad.
Chris Lutz
@Chris - I believe Triptych and hughdbrown are pointing out the append operation on bar is likely not constant time (a quick but unmotivated docs.python.org check didn't find much). And I do think SilentGhost meant a constant.
pbh101
Never mind. I totally missed the membership test in the if statement the first seven or so times I read the function. Yep, O(n^2).
pbh101
what pbh said. Membership test in a list is O(n). That * the loop = O(n^2). This page is helpful: http://wiki.python.org/moin/TimeComplexity
Triptych
what I was referring to is that `bar` and `foo` have different size. it would be `n**2` only if `foo` doesn't have any duplicates, it's better otherwise.
SilentGhost
+1  A: 

If you write a function to do this i would use a generator, it just wants to be used in this case.

def unique(iterable):
    yielded = set()
    for item in iterable:
        if item not in yielded:
            yield item
            yielded.add(item)
DasIch
+2  A: 

Since there isn't an order-preserving answer with a list comprehension, I propose the following:

>>> temp = set()
>>> [c for c in foo if c not in temp and (temp.add(c) or True)]
['a', 'b', 'c', 'd']

which could also be written as

>>> temp = set()
>>> filter(lambda c: c not in temp and (temp.add(c) or True), foo)
['a', 'b', 'c', 'd']

Depending on how many elements are in foo, you might have faster results through repeated hash lookups instead of repeated iterative searches through a temporary list.

c not in temp verifies that temp does not have an item c; and the or True part forces c to be emitted to the output list when the item is added to the set.

Mark Rushakoff
Why store items that have been found already in a hash of `None`s instead of a `set`?
Chris Lutz
Because I didn't think that one through all the way.
Mark Rushakoff
+1  A: 

Inspired by Francesco's answer, rather than making our own filter()-type function, let's make the builtin do some work for us:

def unique(a, s=set()):
    if a not in s:
        s.add(a)
        return True
    return False

Usage:

uniq = filter(unique, orig)

This may or may not perform faster or slower than an answer that implements all of the work in pure Python. Benchmark and see. Of course, this only works once, but it demonstrates the concept. The ideal solution is, of course, to use a class:

class Unique(set):
    def __call__(self, a):
        if a not in self:
            self.add(a)
            return True
        return False

Now we can use it as much as we want:

uniq = filter(Unique(), orig)

Once again, we may (or may not) have thrown performance out the window - the gains of using a built-in function may be offset by the overhead of a class. I just though it was an interesting idea.

Chris Lutz
What happens if you run this twice? `uniq = filter(Unique(), range(10)); print uniq`
hughdbrown
Actually, I meant if you run this twice: `uniq = filter(unique, range(10)); print uniq`
hughdbrown
The `unique` version only works once. Running it a second time on the same data will produce no data, because the function only has one set (the second argument). Running it twice on different data can produce unexpected results, as it will weed out the overlap of the two data sets as well as the duplicates of the second set. The function was my first version, and its limitations led me to create the class version, which suffers no such problems (and is also more generally useful). The function version was shown as a thought-process thing, nothing more.
Chris Lutz
+1  A: 

This is what you want if you need a sorted list at the end:

>>> foo = ['a','b','c','a','b','d','a','d']
>>> bar = sorted(set(foo))
>>> bar
['a', 'b', 'c', 'd']
hughdbrown
list comprehension is redundant. could just say: bar = sorted(set(foo))
recursive
Nice answer -- answers the question directly (although not entirely honestly IMOHO) which has output in natural ordering. +1.
pst
@pst: Not entirely honestly? Like I am trying to pull something over on you? Or...what? I don't get where the OP asked for a stable sort, so I just sorted them because it looked like that *was* a requirement. @recursive: good call. I'll edit that.
hughdbrown