views:

479

answers:

4

I have two very large lists and to loop through it once takes at least a second and I need to do it 200,000 times. What's the fastest way to remove duplicates in two lists to form one?

+9  A: 

This is the fastest way I can think of:

import itertools
output_list = list(set(itertools.chain(first_list, second_list)))

Slight update: As jcd points out, depending on your application, you probably don't need to convert the result back to a list. Since a set is iterable by itself, you might be able to just use it directly:

output_set = set(itertools.chain(first_list, second_list))
for item in output_set:
    # do something

Beware though that any solution involving the use of set() will probably reorder the elements in your list, so there's no guarantee that elements will be in any particular order. That said, since you're combining two lists, it's hard to come up with a good reason why you would need a particular ordering over them anyway, so this is probably not something you need to worry about.

Daniel Pryden
Oh, your solution is better than mine :)
shylent
Thanks for everyone's answers, they've all helped a lot! :)
Cookies
+1. If order *is* important, then perhaps an ordered set will do: http://stackoverflow.com/questions/1653970/does-python-have-an-ordered-set
Stephan202
@Stephan202: Good point. However, I would think switching to an ordered set (with any implementation, but especially a pure Python one) would be significantly slower than an unordered (hashtable-based) set. Depending on the size of the list and how well the hash function performs on the items, it might be faster to just use a normal `set` and then sort the items afterward.
Daniel Pryden
+2  A: 
result = list(set(list1).union(set(list2)))

That's how I'd do it. I am not so sure about performance, though, but it is certainly better, than doing it by hand.

shylent
`set.union(self, other)` is fine with any iterable as `other`
kaizer.se
+5  A: 

As Daniel states, a set cannot contain duplicate entries - so concatenate the lists:

list1 + list2

Then convert the new list to a set:

set(list1 + list2)

Then back to a list:

list(set(list1 + list2))
Rob Golding
Thanks for explaining what my code was doing. Beat me to it! :-) I'd just mention that the reason I edited my answer to use `itertools.chain()` instead of just concatenating the lists is because it avoids having to allocate a third large list in memory. The `set()` constructor doesn't actually need a list, it just needs an iterable that can iterate over all the elements, and `itertools.chain()` does that more efficiently (by avoiding copying).
Daniel Pryden
+7  A: 

I'd recommend something like this:

def combine_lists(list1, list2):
    s = set(list1)
    s.update(list2)
    return list(s)

This eliminates the problem of creating a monster list of the concatenation of the first two.

Depending on what you're doing with the output, don't bother to convert back to a list. If ordering is important, you might need some sort of decorate/sort/undecorate shenanigans around this.

jcdyer
Agreed, there's no need to concatenate the two lists -- that just wastes memory. I'd be interested to see the performance difference between calling `s.update(list2)` versus the iterator approach I used above. Your approach may be slightly faster. However, as you point out, you get a much bigger performance savings by simply not converting back to a list at the end.
Daniel Pryden
I ran a few timeits, and it seems to vary, which is faster, but never by more than 5% or 10% one way or the other. I'd call it a draw.
jcdyer
Given that itertools is only chaining two objects, I'd say its impact is pretty minimal, so the question is whether there's a significant difference between set()ing one big list, or set()ing half that list and .update()ing the rest of it. Looks like there isn't.
jcdyer