ansaurus

Question

In Python, how do I take a list and reduce it to a list of duplicates?

Answer 1

+17 A:

This code should work:

duplicates = set()
found = set()
for item in source:
    if item in found:
        duplicates.add(item)
    else:
        found.add(item)

Staale 2009-03-26 13:06:04

I'll have to make sure this works later but this fits what I was thinking and looks clean.

Eugene M 2009-03-26 13:14:12

Any specific advantage by using set() than lists? 'in' works for lists also?

Lakshman Prasad 2009-03-26 14:00:27

@becomingGuru, set() will prevent duplicate additions but is not ordered. Advantages/disadvantages depend on your application

jcoon 2009-03-26 14:08:46

@coonj: i think he meant set for found, it really is not necessary.

SilentGhost 2009-03-26 14:58:48

@becomingGuru: I think 'in' performs better with sets than with lists. I haven't looked too much into the implementation of this, but my own benchmarks verify it.

David Berger 2009-03-26 15:52:15

You will then have duplicates in the duplicates list if you don't use a set.

Staale 2009-03-26 18:07:00

Sets are hash tables I believe, which means inclusion testing is O(1) on average, whereas inclusion with lists is O(n).

John Fouhy 2009-03-26 21:34:13

Answer 2

+5 A:

This will create the list in one line:

L = [1, 2, 3, 3, 4, 4, 4]
L_dup = set([i for i in L if L.count(i) > 1])

jcoon 2009-03-26 13:08:43

Although short, this code has O(N²) performance. For each item in L, a full count of that item in L will be needed. Also only in 2.6

Staale 2009-03-26 13:10:16

this works in 2.5

jcoon 2009-03-26 13:11:29

@Staale, Brian: do you know the typical size of a input Eugene M is working with so that he needs to care about performance?

SilentGhost 2009-03-26 13:14:46

Actually N is around 20-50 short strings if that information helps.

Eugene M 2009-03-26 13:21:50

+1 if it's for 50 strings, perf are not an issue.

e-satis 2009-03-26 13:34:46

another superfluous use of a list comprehension when a generator expression would do the job.

hop 2009-03-26 23:18:27

Answer 3

+2 A:

Definitely not the fastest way to do that, but it seem to work solve the problem:

>>> lst = [23, 32, 23, None]
>>> set(i for i in lst if lst.count(i) > 1)
{23}

SilentGhost 2009-03-26 13:09:41

I like this approach! O(n) and O(n*n) for small lists should be alright.

Lakshman Prasad 2009-03-26 13:57:51

Answer 4

+6 A:

groupby from itertools will probably be useful here:


from itertools import groupby
duplicated=[k for (k,g) in groupby(sorted(l)) if len(list(g)) > 1]

Basically you use it to find elements that appear more than once...

NB. the call to sorted is needed, as groupby only works properly if the input is sorted.

John Montgomery 2009-03-26 13:12:03

Answer 5

+2 A:

If you don't care about the order of the duplicates:

a = [1, 2, 3, 4, 5, 4, 6, 4, 7, 8, 8]
b = sorted(a)
duplicates = set([x for x, y in zip(b[:-1], b[1:]) if x == y])

unbeknown 2009-03-26 13:13:36

Answer 6

A:

EDIT : Ok, doesn't work since you want duplicates only.

Whith python > 2.4 :

You have set, just do :

my_filtered_list = list(set(mylist))

Set is a data structure that doesn't have duplicate by nature.

With older Python versions :

my_filtered_list = list(dict.fromkeys(mylist).keys())

Dictionary map a unique key to a value. We use the "unique" caracteristc to get rid of the duplicate.

e-satis 2009-03-26 13:29:55

Answer 7

A:

Personally, I think this is the simplest way to do it with performance O(n). Similar to vartec's solution but no import required and no Python version dependencies to worry about:

def getDuplicates(iterable):
    d = {}
    for i in iterable:
        d[i] = d.get(i, 0) + 1
    return [i for i in d if d[i] > 1]

mhawke 2009-03-26 23:10:18

Answer 8

A:

the solutions based on 'set' have a small drawback, namely they only work for hashable objects.

the solution based on itertools.groupby on the other hand works for all comparable objects (e.g.: dictionaries and lists).

mariotomo 2009-06-09 15:06:01

ansaurus

tags:

views:

answers:

In Python, how do I take a list and reduce it to a list of duplicates?

Whith python > 2.4 :

With older Python versions :

related questions