tags:

views:

362

answers:

8

I have a list of strings that should be unique. I want to be able to check for duplicates quickly. Specifically, I'd like to be able to take the original list and produce a new list containing any repeated items. I don't care how many times the items are repeated so it doesn't have to have a word twice if there are two duplicates.

Unfortunately, I can't think of a way to do this that wouldn't be clunky. Any suggestions?

EDIT: Thanks for the answers and I thought I'd make a clarification. I'm not concerned with having a list of uniques for it's own sake. I'm generating the list based off of text files and I want to know what the duplicates are so I can go in the text files and remove them if any show up.

+17  A: 

This code should work:

duplicates = set()
found = set()
for item in source:
    if item in found:
        duplicates.add(item)
    else:
        found.add(item)
Staale
I'll have to make sure this works later but this fits what I was thinking and looks clean.
Eugene M
Any specific advantage by using set() than lists? 'in' works for lists also?
Lakshman Prasad
@becomingGuru, set() will prevent duplicate additions but is not ordered. Advantages/disadvantages depend on your application
jcoon
@coonj: i think he meant set for found, it really is not necessary.
SilentGhost
@becomingGuru: I think 'in' performs better with sets than with lists. I haven't looked too much into the implementation of this, but my own benchmarks verify it.
David Berger
You will then have duplicates in the duplicates list if you don't use a set.
Staale
Sets are hash tables I believe, which means inclusion testing is O(1) on average, whereas inclusion with lists is O(n).
John Fouhy
+5  A: 

This will create the list in one line:

L = [1, 2, 3, 3, 4, 4, 4]
L_dup = set([i for i in L if L.count(i) > 1])
jcoon
Although short, this code has O(N²) performance. For each item in L, a full count of that item in L will be needed. Also only in 2.6
Staale
this works in 2.5
jcoon
@Staale, Brian: do you know the typical size of a input Eugene M is working with so that he needs to care about performance?
SilentGhost
Actually N is around 20-50 short strings if that information helps.
Eugene M
+1 if it's for 50 strings, perf are not an issue.
e-satis
another superfluous use of a list comprehension when a generator expression would do the job.
hop
+2  A: 

Definitely not the fastest way to do that, but it seem to work solve the problem:

>>> lst = [23, 32, 23, None]
>>> set(i for i in lst if lst.count(i) > 1)
{23}
SilentGhost
I like this approach! O(n) and O(n*n) for small lists should be alright.
Lakshman Prasad
+6  A: 

groupby from itertools will probably be useful here:


from itertools import groupby
duplicated=[k for (k,g) in groupby(sorted(l)) if len(list(g)) > 1]

Basically you use it to find elements that appear more than once...

NB. the call to sorted is needed, as groupby only works properly if the input is sorted.

John Montgomery
+2  A: 

If you don't care about the order of the duplicates:

a = [1, 2, 3, 4, 5, 4, 6, 4, 7, 8, 8]
b = sorted(a)
duplicates = set([x for x, y in zip(b[:-1], b[1:]) if x == y])
unbeknown
A: 

EDIT : Ok, doesn't work since you want duplicates only.

Whith python > 2.4 :

You have set, just do :

my_filtered_list = list(set(mylist))

Set is a data structure that doesn't have duplicate by nature.

With older Python versions :

my_filtered_list = list(dict.fromkeys(mylist).keys())

Dictionary map a unique key to a value. We use the "unique" caracteristc to get rid of the duplicate.

e-satis
A: 

Personally, I think this is the simplest way to do it with performance O(n). Similar to vartec's solution but no import required and no Python version dependencies to worry about:

def getDuplicates(iterable):
    d = {}
    for i in iterable:
        d[i] = d.get(i, 0) + 1
    return [i for i in d if d[i] > 1]
mhawke
A: 

the solutions based on 'set' have a small drawback, namely they only work for hashable objects.

the solution based on itertools.groupby on the other hand works for all comparable objects (e.g.: dictionaries and lists).

mariotomo