ansaurus

Question

In Python, what is the fastest algorithm for removing duplicates from a list so that all elements are unique *while preserving order*?

Answer 1

A:

I have no experience with python, but an algorithm would be to sort the list, then remove duplicates (by comparing to previous items in the list), and finally find the position in the new list by comparing with the old list.

Longer answer: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52560

2008-09-18 01:29:06

Needs to preverve order

Flame 2008-09-18 04:19:29

Answer 2

A:

I haven't done any tests, but one possible algorithm might be to create a second list, and iterate through the first list. If an item is not in the second list, add it to the second list.

x = [1, 1, 2, 'a', 'a', 3]
y = []
for each in x:
    if each not in y:
        y.append(each)

Matthew Schinckel 2008-09-18 01:29:11

I find your use of the variable name "each" really confusing to read, probably because in many languages it is a keyword. It's much clearer to use item or just i.

rjmunro 2008-11-12 01:00:49

'i' to me implies an index - we aren't iterating through indices, we are iterating through objects. I'd prefer item, but I don't see 'each' as bad - just because it is a keyword in another language, why prevent it's use here. Syntax highlighting (as shown above) picks it up fine...

Matthew Schinckel 2008-11-26 02:30:30

Other than AppleScript, what languages use the word 'each' as a keyword?

Matthew Schinckel 2008-11-26 02:33:55

Answer 3

+3 A:

Taken from http://www.peterbe.com/plog/uniqifiers-benchmark

def f5(seq, idfun=None):  
    # order preserving 
    if idfun is None: 
        def idfun(x): return x 
    seen = {} 
    result = [] 
    for item in seq: 
        marker = idfun(item) 
        # in old Python versions: 
        # if seen.has_key(marker) 
        # but in new ones: 
        if marker in seen: continue 
        seen[marker] = 1 
        result.append(item) 
    return result

ctcherry 2008-09-18 01:30:50

it is slower than corresponding in-place version (at least for some inputs). See http://stackoverflow.com/questions/89178/#282589

J.F. Sebastian 2008-11-12 00:26:17

Answer 4

A:

>>> def unique(list):
...   y = []
...   for x in list:
...     if x not in y:
...       y.append(x)
...   return y

etchasketch 2008-09-18 01:32:47

This one is also O(n^2)

Terhorst 2008-09-18 01:58:49

To explain why: searching for x in a list structure (y) is O(n), while searching for x in a set (or dictionary) is O(1).

Hamish Downer 2008-09-27 16:17:37

Answer 5

+10 A:

What's going to be fastest depends on what percentage of your list is duplicates. If it's nearly all duplicates, with few unique items, creating a new list will probably be faster. If it's mostly unique items, removing them from the original list (or a copy) will be faster.

Here's one for modifying the list in place:

def unique(items):
  seen = set()
  for i in xrange(len(items)-1, -1, -1):
    it = items[i]
    if it in seen:
      del items[i]
    else:
      seen.add(it)

Iterating backwards over the indices ensures that removing items doesn't affect the iteration.

Allen 2008-09-18 01:33:44

This gives different results to the other solutions (the OP didn't specify which is correct), as regards which duplicate to keep.This solution: [1, 2, 1] -> [2, 1]Other solutions: [1, 2, 1] -> [1, 2]

James Hopkin 2008-09-18 09:24:50

I added a clarification about this in the question text.

Jeff Miller 2008-09-18 11:50:02

Answer 6

A:

O(n) if dict is hash, O(nlogn) if dict is tree, and simple, fixed. Thanks to Matthew for the suggestion. Sorry I don't know the underlying types.

def unique(x):    
  output = []
  y = {}
  for item in x:
    y[item] = ""

  for item in x:
    if item in y:
      output.append(item)

  return output

Wesley Tarle 2008-09-18 01:35:33

FYI, you can also do that with a set so you don't have to set it equal to an empty string.

Jason Baker 2008-09-18 02:42:02

Answer 7

+20 A:

def unique(items):
    found = set([])
    keep = []

    for item in items:
        if item not in found:
            found.add(item)
            keep.append(item)

    return keep

print unique([1, 1, 2, 'a', 'a', 3])

Terhorst 2008-09-18 01:41:18

set() is better than set([]).

Constantin 2008-09-29 15:22:32

In-place algorithms are faster. See james' and mine answers.

J.F. Sebastian 2008-11-12 00:44:17

Answer 8

+3 A:

You can actually do something really cool in Python to solve this. You can create a list comprehension that would reference itself as it is being built. As follows:

   # remove duplicates...
   def unique(my_list):
       return [x for x in my_list if x not in locals()['_[1]'].__self__]

Edit: I removed the "self", and it works on Mac OS X, Python 2.5.1.

The _[1] is Python's "secret" reference to the new list. The above, of course, is a little messy, but you could adapt it fit your needs as necessary. For example, you can actually write a function that returns a reference to the comprehension; it would look more like:

return [x for x in my_list if x not in this_list()]

Jake 2008-09-18 01:43:03

I have never seen that hack before, kudos!

Jerub 2008-09-18 01:47:22

That's pretty sweet. Thanks.

Terhorst 2008-09-18 01:54:38

The order is O(n^2), though.

Terhorst 2008-09-18 01:57:07

The example as given does not compile for me -- the trailing ".__self__" is not valid [[Linux 2.6 w/ Python 2.5.1]]

Kevin Little 2008-09-18 02:19:52

Holy cow, you're turning Python into Perl with the magic underscore business. Just say no.

Parand 2009-03-13 03:49:58

Answer 9

A:

>>> x=[1,1,2,'a','a',3]
>>> y = [ _x for _x in x if not _x in locals()['_[1]'] ]
>>> y
[1, 2, 'a', 3]

"locals()['_[1]']" is the "secret name" of the list being created.

Kevin Little 2008-09-18 01:54:40

Presence of _[1] local is not guaranteed by language.

Constantin 2008-09-29 15:20:14

"<item> in <list>" is O(n), so this is slow.

Charles Duffy 2008-11-24 10:41:07

Answer 10

+3 A:

One-liner:

new_list = reduce(lambda x,y: x+[y][:1-int(y in x)], my_list, [])

Tyler 2008-09-18 01:59:05

Answer 11

+2 A:

Do the duplicates necessarily need to be in the list in the first place? There's no overhead as far as looking the elements up, but there is a little bit more overhead in adding elements (though the overhead should be O(1) ).

>>> x  = []
>>> y = set()
>>> def add_to_x(val):
...     if val not in y:
...             x.append(val)
...             y.add(val)
...     print x
...     print y
... 
>>> add_to_x(1)
[1]
set([1])
>>> add_to_x(1)
[1]
set([1])
>>> add_to_x(1)
[1]
set([1])
>>>

Jason Baker 2008-09-18 02:06:40

Answer 12

+1 A:

has_key in python is O(1). Insertion and retrieval from a hash is also O(1). Loops through n items twice, so O(n).

def unique(list):
  s = {}
  output = []
  for x in list:
    count = 1
    if(s.has_key(x)):
      count = s[x] + 1

    s[x] = count
  for x in list:
    count = s[x]
    if(count > 0):
      s[x] = 0
      output.append(x)
  return output

etchasketch 2008-09-18 04:07:21

Answer 13

A:

One pass.

a = [1,1,'a','b','c','c']

new_list = []
prev = None

while 1:
    try:
        i = a.pop(0)
        if i != prev:
            new_list.append(i)
        prev = i
    except IndexError:
        break

Sergei Stolyarov 2008-09-18 05:08:13

Requires sorted input, doesn't it?

Constantin 2008-09-29 15:20:56

Answer 14

+9 A:

Using:

lst = [8, 8, 9, 9, 7, 15, 15, 2, 20, 13, 2, 24, 6, 11, 7, 12, 4, 10, 18, 13, 23, 11, 3, 11, 12, 10, 4, 5, 4, 22, 6, 3, 19, 14, 21, 11, 1, 5, 14, 8, 0, 1, 16, 5, 10, 13, 17, 1, 16, 17, 12, 6, 10, 0, 3, 9, 9, 3, 7, 7, 6, 6, 7, 5, 14, 18, 12, 19, 2, 8, 9, 0, 8, 4, 5]

And using the timeit module:

$ python -m timeit -s 'import uniquetest' 'uniquetest.etchasketch(uniquetest.lst)'

(and so on for the various other functions -- which I named after their posters), I have the following results (on my first generation Intel MacBook Pro):

Allen: 14.6 usec per loop [1]
Terhorst: 26.6 usec per loop
Tarle: 44.7 usec per loop
ctcherry: 44.8 usec per loop
Etchasketch 1 (the short one): 64.6 usec per loop
Schinckel: 65 usec per loop
Etchasketch 2: 71.6 usec per loop
Little: 89.4 usec per loop
Tyler: 179 usec per loop

[1] Note that Allen modifies the list in place – I believe this has skewed the time, in that the timeit module runs the code 100000 times and 99999 of them are with the dupe-less list.

Summary: Straight-forward implementation with sets wins over confusing one-liners :-)

John Fouhy 2008-09-18 05:14:19

james suggested a faster version. See http://stackoverflow.com/questions/89178/#91430

J.F. Sebastian 2008-11-12 00:16:43

Answer 15

A:

I don't know if this one is fast or not, but at least it is simple.

Simply, convert it first to a set and then again to a list

def unique(container):
  return list(set(container))

Franck Mesirard 2008-09-18 08:51:32

This does not preserve order.

Eli Courtwright 2008-09-18 13:17:29

Answer 16

+6 A:

This is the fastest in-place method I've found (assuming a large proportion of duplicates):

def unique(l):
    s = set(); n = 0
    for x in l:
        if x not in s: s.add(x); l[n] = x; n += 1
    del l[n:]

This is 10% faster than Allen's implementation, on which it is based (timed with timeit.repeat, JIT compiled by psyco). It keeps the first instance of any duplicate.

repton-infinity: I'd be interested if you could confirm my timings.

James Hopkin 2008-09-18 10:17:44

Dictionaries are slightly faster than sets. See my answer http://stackoverflow.com/questions/89178/#282589

J.F. Sebastian 2008-11-12 00:14:05

Answer 17

+1 A:

Benchmark and a clear anwser at :

http://www.peterbe.com/plog/uniqifiers-benchmark

e-satis 2008-09-18 11:54:34

Answer 18

+1 A:

There are some great, efficient solutions here. However, for anyone not concerned with the absolute most efficient O(n) solution, I'd go with the simple one-liner O(n^2*log(n)) solution:

def unique(xs):
    return sorted(set(xs), key=lambda x: xs.index(x))

or the more efficient two-liner O(n*log(n)) solution:

def unique(xs):
    positions = dict((e,pos) for pos,e in reversed(list(enumerate(xs))))
    return sorted(set(xs), key=lambda x: positions[x])

Eli Courtwright 2008-09-18 13:23:05

That code is difficult to understand, and you say it's less efficient than the other solutions already presented here. So why would you go with it?

Jeff Miller 2008-09-18 17:40:52

I consider this easy to understand; passing a lambda function as the key parameter of sorted is really the canonical way to sort a list in Python. Most of my Python work involves generating reports on lists of statistics, and so to me this seems like the simplest and most Pythonic approach.

Eli Courtwright 2008-09-19 12:51:58

While I agree your solution is succinct, the question asked for the fastest algorithm, not the most Pythonic.

Jeff Miller 2008-10-01 22:30:07

Answer 19

A:

If you take out the empty list from the call to set() in Terhost's answer, you get a little speed boost.

Change: found = set([])
to: found = set()

However, you don't need the set at all.

def unique(items):
    keep = []

    for item in items:
        if item not in keep:
            keep.append(item)

    return keep

Using timeit I got these results:

with set([]) -- 4.97210427363
with set() -- 4.65712377445
with no set -- 3.44865284975

2008-09-19 09:42:20

yeah, when you have few data, I bet the set internal mecanisme is slower that iterating over a list. But if you got maaaaaaaaaaany element, I think set are faster. Or what would be the point of this data structures ;-)

e-satis 2008-09-23 21:21:05

Answer 20

+4 A:

Obligatory generator-based variation:

def unique(seq):
  seen = set()
  for x in seq:
    if x not in seen:
      seen.add(x)
      yield x

Constantin 2008-09-27 15:54:03

Answer 21

A:

a=[1,2,3,4,5,7,7,8,8,9,9,3,45]

def unique(l):

ids={}
for item in l:
 if not ids.has_key(item):
  ids[item]=item
return  ids.keys()

print a

print unique(a)

----------------------------

Inserting elements will take theta(n) retrieving if element is exiting or not will take constant time testing all the items will take also theta(n) so we can see that this solution will take theta(n) Bear in Mind that dictionary in python implemented by hash table

aboSamoor 2008-11-11 00:32:18

The questions says "*while preserving order*". A Python dictionary doesn't preserve order.

J.F. Sebastian 2008-11-24 11:28:10

Answer 22

+7 A:

Here is the fastest solution so far (for the following input):

def del_dups(seq):
    seen = {}
    pos = 0
    for item in seq:
        if item not in seen:
            seen[item] = True
            seq[pos] = item
            pos += 1
    del seq[pos:]

lst = [8, 8, 9, 9, 7, 15, 15, 2, 20, 13, 2, 24, 6, 11, 7, 12, 4, 10, 18, 
       13, 23, 11, 3, 11, 12, 10, 4, 5, 4, 22, 6, 3, 19, 14, 21, 11, 1, 
       5, 14, 8, 0, 1, 16, 5, 10, 13, 17, 1, 16, 17, 12, 6, 10, 0, 3, 9, 
       9, 3, 7, 7, 6, 6, 7, 5, 14, 18, 12, 19, 2, 8, 9, 0, 8, 4, 5]
del_dups(lst)
print(lst)
# -> [8, 9, 7, 15, 2, 20, 13, 24, 6, 11, 12, 4, 10, 18, 23, 3, 5, 22, 19, 14, 
#     21, 1, 0, 16, 17]

Dictionary lookup is slightly faster then the set's one in Python 3.

J.F. Sebastian 2008-11-12 00:04:22

Answer 23

A:

An in-place one-liner for this:

>>> x = [1, 1, 2, 'a', 'a', 3]
>>> [ item for pos,item in enumerate(x) if x.index(item)==pos ]
[1, 2, 'a', 3]

Mario Ruggier 2010-04-09 13:11:48

ansaurus

tags:

views:

answers:

In Python, what is the fastest algorithm for removing duplicates from a list so that all elements are unique while preserving order?

----------------------------

related questions