ansaurus

Question

How to remove these duplicates in a list (python)

Answer 1

+2 A:

Make a new dictionary, with 'u2.com' and 'abc.com' as the keys, and your list elements as the values. The dictionary will enforce uniqueness. Something like this:

uniquelist = dict((element['link'], element) for element in reversed(biglist))

(The reversed is there so that the first elements in the list will be the ones that remain in the dictionary. If you take that out, then you will get the last element instead).

Then you can get elements back into a list like this:

biglist = uniquelist.values()

Ian Clelland 2009-10-07 23:21:24

You have an extra, unneeded pair of parentheses -- the parentheses for dict() are good enough, you don't need another set to introduce the generator expression. Also, you don't *need* parentheses around the tuple either, although it's probably a good idea.

Daniel Pryden 2009-10-07 23:31:21

BTW, +1 for using a `dict`. Assuming you have a real `list` (not just any iterable), `reversed()` should be O(1), building the `dict` should be O(n), and extracting the values should be O(n). So this seems like it should be the fastest way, especially for a large list.

Daniel Pryden 2009-10-07 23:36:02

Scratch that. `reversed()` of a list will be O(n), due to copying (at least in Python 2.x). It still seems like overall performance should be O(n) though, which should be faster than sorting the list.

Daniel Pryden 2009-10-07 23:37:29

How is this a reasonable answer when it outputs a dict, and the OP requested a list? What if original order is important?

Triptych 2009-10-08 01:25:09

@Daniel: I didn't realise that the parens were optional around a generator expression, as long as it's the only argument to the function, although it turns out that you *do* need them around the tuple to avoid ambiguity in the parser. Edited.

Ian Clelland 2009-10-08 15:06:42

@Triptych: The output of this code is what you make of it -- I mentioned that when you want a list back, you just call values() on the intermediate dict object. "What if original order is important?" Then don't use this code (or any of the other answers here, except for Alex Martelli's iterative solution.)

Ian Clelland 2009-10-08 15:15:00

Answer 2

+2 A:

You can sort the list, using the link field of each dictionary as the sort key, then iterate through the list once and remove duplicates (or rather, create a new list with duplicates removed, as is the Python idiom), like so:

# sort the list using the 'link' item as the sort key
biglist.sort(key=lambda elt: elt['link'])

newbiglist = []
for item in biglist:
    if newbiglist == [] or item['link'] != newbiglist[-1]['link']:
        newbiglist.append(item)

This code will give you the first element (relative ordering in the original biglist) for any group of "duplicates". This is true because the .sort() algorithm used by Python is guaranteed to be a stable sort -- it does not change the order of elements determined to be equal to one another (in this case, elements with the same link).

dcrosta 2009-10-07 23:21:53

I like this approach. One thing I would do is use `for item in sorted(biglist, key=lambda elt: elt['link'])` instead of sorting the biglist in place. It will (temporarily) use more memory (because of copying the biglist), but depending on your application you may not want to mutate a big data object like that.

Daniel Pryden 2009-10-07 23:26:32

Yup, great point.

dcrosta 2009-10-07 23:28:59

You can get a nice speedup if you just get rid of that `newbiglist == []` inside the loop. How can you get rid of it? Simply initialize newbiglist as `newbiglist = [ biglist[0] ]` and then you know there is always something there for the `newbiglist[-1]` to reference.

steveha 2009-10-07 23:51:28

Answer 3

+1 A:

biglist = \
[ 
    {'title':'U2 Band','link':'u2.com'}, 
    {'title':'ABC Station','link':'abc.com'}, 
    {'title':'Live Concert by U2','link':'u2.com'} 
]

def dedupe(lst):
    d = {}
    for x in lst:
        link = x["link"]
        if link in d:
            continue
        d[link] = x
    return d.values()

lst = dedupe(biglist)

dedupe() keeps the first of any duplicates.

steveha 2009-10-07 23:32:42

This is pretty much the same as Ian Clelland's answer above, except it doesn't reverse the list first. Saving the reverse is faster, but doing the `link in d` test for every element seems like it might be slower than simply overwriting. In any case, the speed difference is probably lost in the noise, so it really depends on what style you prefer.

Daniel Pryden 2009-10-07 23:41:26

The `link in d` test should be fast, because dictionaries use hashed lookups; and this lets you avoid the sort. But most importantly, I think this approach is less tricky, and easier to extend; what happens if you *sometimes* want the first appearance of a dupe and sometimes you want the later appearance? With this one it would be trivial to add logic to do that; Ian Clelland's version, not so much.

steveha 2009-10-07 23:49:32

Hmmm, I think I spoke too hastily there. I think Ian Clelland's version could be pretty easily modified to allow for keeping the former or the later version. So it really does just depend on what style you prefer.

steveha 2009-10-07 23:53:28

Answer 4

+1 A:

Probably the fastest approach, for a really big list, if you want to preserve the exact order of the items that remain, is the following...:

biglist = [ 
    {'title':'U2 Band','link':'u2.com'}, 
    {'title':'ABC Station','link':'abc.com'}, 
    {'title':'Live Concert by U2','link':'u2.com'} 
]

known_links = set()
newlist = []

for d in biglist:
  link = d['link']
  if link in known_links: continue
  newlist.append(d)
  known_links.add(link)

biglist[:] = newlist

Alex Martelli 2009-10-08 02:08:34

What is the purpose of the last line, `biglist[:] = newlist` ? Seems odd, usually `a = b[:]` is done to deep-copy a list

dbr 2009-10-08 03:35:01

@dbr, `barename = whatever` re-binds `barename` -- but if there are any other reference extant to whatever `barename` used to refer to, they're unaffects. So, in this case, this would **not** "remove the duplicates from the list", as requested: it would make a _new_ list object without duplicates, but leave the old list still full of dupes (if there are any other references keeping that list object alive, of course). `somelist[:] = whatever` is **completely** different: **this** statement is changing the body of _the list object itself_ -- i.e, it **is** "removing duplicates from _the_ list"!

Alex Martelli 2009-10-08 04:05:27

I like this because it is stable: the items returned will be in the same order as the appeared in the original list. That is important in some applications.

Vebjorn Ljosa 2010-07-22 18:44:55

Answer 5

A:

You can use defaultdict to group items by link, then removed duplicates if you want to.

from collections import defaultdict

nodupes = defaultdict(list)
for d in biglist:
    nodupes[d['url']].append(d['title']

This will give you:

defaultdict(<type 'list'>, {'abc.com': ['ABC Station'], 'u2.com': ['U2 Band', 
'Live Concert by U2']})

Imran 2009-10-08 02:22:41

ansaurus

tags:

views:

answers:

How to remove these duplicates in a list (python)

related questions