ansaurus

Question

Remove duplicates in a list while keeping its order (Python)

Answer 1

A:

A super easy way to do this is:

def uniq(a):
    if len(a) == 0:
        return []
    else:
        return [a[0]] + uniq([x for x in a if x != a[0]])

This is not the most efficient way, because:

it searches through the whole list for every element in the list, so it's O(n^2)
it's recursive so uses a stack depth equal to the length of the list

However, for simple uses (no more than a few hundred items, not performance critical) it is sufficient.

Greg Hewgill 2009-10-11 00:55:38

Can anyone come up with a way that is scalable?

TIMEX 2009-10-11 00:59:46

Kalmi's answer points to a number of good solutions.

Greg Hewgill 2009-10-11 01:01:32

Answer 2

+1 A:

This page discusses different methods and their speeds: http://www.peterbe.com/plog/uniqifiers-benchmark

The recommended* method:

def f5(seq, idfun=None):  
    # order preserving 
    if idfun is None: 
        def idfun(x): return x 
    seen = {} 
    result = [] 
    for item in seq: 
        marker = idfun(item) 
        # in old Python versions: 
        # if seen.has_key(marker) 
        # but in new ones: 
        if marker in seen: continue 
        seen[marker] = 1 
        result.append(item) 
    return result

f5(biglist,lambda x: x['link'])

*by that page

Kalmi 2009-10-11 00:56:46

Answer 3

A:

dups = {}
newlist = []
for x in biglist:
    if x['link'] not in dups:
      newlist.append(x)
      dups[x['link']] = None

print newlist

produces

[{'link': 'u2.com', 'title': 'U2 Band'}, {'link': 'abc.com', 'title': 'ABC Station'}]

Note that here I used a dictionary. This makes the test not in dups much more efficient than using a list.

Peter 2009-10-11 00:59:11

You're wrong about checking in a dict being faster than in a set (lists are a completely different matter).

Alex Martelli 2009-10-11 01:07:28

ok, fixed, thanks. I guess set is probably implemented with a hash.

Peter 2009-10-11 01:08:12

Answer 4

A:

I think using a set should be pretty efficent.

seen_links = set()
for index in len(biglist):
    link = biglist[index]['link']
    if link in seen_links:
        del(biglist[index])
    seen_links.add(link)

I think this should come in at O(nlog(n))

ABentSpoon 2009-10-11 00:59:49

Answer 5

+5 A:

My answer to your other question, which you completely ignored!, shows you're wrong in claiming that

The answers of that question did not keep the "order"

my answer did keep order, and it clearly said it did. Here it is again, with added emphasis to see if you can just keep ignoring it...:

Probably the fastest approach, for a really big list, if you want to preserve the exact order of the items that remain, is the following...:

biglist = [ 
    {'title':'U2 Band','link':'u2.com'}, 
    {'title':'ABC Station','link':'abc.com'}, 
    {'title':'Live Concert by U2','link':'u2.com'} 
]

known_links = set()
newlist = []

for d in biglist:
  link = d['link']
  if link in known_links: continue
  newlist.append(d)
  known_links.add(link)

biglist[:] = newlist

Alex Martelli 2009-10-11 01:11:15

Thanks a lot Alexa Martelli! I didn't realize this. This is perfect, thanks.

TIMEX 2009-10-11 01:46:49

Answer 6

+1 A:

Generators are great.

def unique( seq ):
    seen = set()
    for item in seq:
        if item not in seen:
            seen.add( item )
            yield item

biglist[:] = unique( biglist )

THC4k 2009-10-11 01:11:43

ansaurus

tags:

views:

answers:

Remove duplicates in a list while keeping its order (Python)

related questions