views:

140

answers:

4

I want to check if any of the items in one list are present in another list. I can do it simply with the code below, but I suspect there might be a library function to do this. If not, is there a more pythonic method of achieving the same result.

In [78]: a = [1, 2, 3, 4, 5]

In [79]: b = [8, 7, 6]

In [80]: c = [8, 7, 6, 5]

In [81]: def lists_overlap(a, b):
   ....:     for i in a:
   ....:         if i in b:
   ....:             return True
   ....:     return False
   ....: 

In [82]: lists_overlap(a, b)
Out[82]: False

In [83]: lists_overlap(a, c)
Out[83]: True

In [84]: def lists_overlap2(a, b):
   ....:     return len(set(a).intersection(set(b))) > 0
   ....: 
+12  A: 
def lists_overlap3(a, b):
    return bool(set(a) & set(b))

Note: the above assumes that you want a boolean as the answer. If all you need is an expression to use in an if statement, just use if set(a) & set(b):

John Machin
This is worst-case O(n + m). However, the down side is that it creates a new set, and doesn't bail out when a common element is found early.
Matthew Flaschen
I curious as to why this is `O(n + m)`. My guess would be that sets are implemented using hash-tables, and thus the `in` operator can work in `O(1)` time (except in degenerate cases). Is this correct? If so, given that hash tables have worst case lookup performance of `O(n)`, does this mean that in the unlike worse case it will have `O(n * m)` performance?
fmark
@fmark: Theoretically, you are right. Practically, nobody cares; read the comments in Objects/dictobject.c in the CPython source (sets are just dicts with only keys, no values) and see if you can generate a list of keys that will cause O(n) lookup performance.
John Machin
Okay, thanks for clarifying, I was wondering if there was some magic going on :) . While I agree that practically I don't need to care, it is trivial to generate a list of keys that will cause `O(n)` lookup performance ;), see http://pastebin.com/Kn3kAW7u Just for lafs.
fmark
@fmark: No magic, just engineering. Yeah it's trivial to define a hash function that returns a constant. Ha ha chuckle chuckle. Now try again, without sodding about with a non-standard hash function.
John Machin
Yeah, I know. Plus I just read the source you pointed me to, which documents even more magic in the case of non-randomish hash functions (such as the built-in one). I assumed it required randomishness, like the Java one, which results in monstrousities like this http://stackoverflow.com/questions/2634690/good-hash-function-for-a-2d-index . I need to keep reminding myself that **Python Is Not Java** (thank deity!).
fmark
A: 

You could also use any with list comprehension:

any([item in a for item in b])
Ioachim
You could, but the time is O(n * m) whereas the time for the set intersection approach is O(n + m). You could also do it WITHOUT list comprehension (lose the `[]`) and it would run faster and use less memory, but the time would still be O(n * m).
John Machin
While your big O analysis is true, I suspect that for small values of n and m the time it takes to build the underlying hashtables will come into play. Big O ignores the time it takes to compute the hashes.
Anthony Conyers
Building a "hashtable" is amortised O(n).
John Machin
I get that but the constant you're throwing away is pretty big. It doesn't matter for large values of n, but it does for small ones.
Anthony Conyers
A: 

You can use the any built in function /w a generator expression:

def list_overlap(a,b): 
     return any(i for i in a if i in b)

As John and Lie have pointed out this gives incorrect results when for every i shared by the two lists bool(i) == False. It should be:

return any(i in b for i in a)
Anthony Conyers
Time is O(n * m).
John Machin
give the wrong result when a = {0, 1, 2} and b = {0, 3, 4}
Lie Ryan
Amplifying Lie Ryan's comment: will give wrong result for any item x that's in the intersection where `bool(x)` is False. In Lie Ryan's example, x is 0. Only fix is `any(True for i in a if i in b)` which is better written as the already seen `any(i in b for i in a)`.
John Machin
Correction: will give wrong result when *all* items `x` in the intersection are such that `bool(x)` is `False`.
John Machin
+6  A: 
def lists_overlap(a, b):
  sb = set(b)
  return any(el in sb for el in a)

This is asymptotically optimal (worst case O(n + m)), and might be better than the intersection approach due to any's short-circuiting.

E.g.:

lists_overlap([3,4,5], [1,2,3])

will return True as soon as it gets to 3 in sb

EDIT: Another variation (with thanks to Dave Kirby):

def lists_overlap(a, b):
  sb = set(b)
  return any(itertools.imap(sb.__contains__, a))

This relies on imap's iterator, which is implemented in C, rather than a generator comprehension. It also uses sb.__contains__ as the mapping function. I don't know how much performance difference this makes. It will still short-circuit.

Matthew Flaschen
Loops in intersection approach are all in C code; there's one loop in your approach that includes Python code. The big unknown is whether an empty intersection is likely or unlikely.
John Machin
You could also use `any(itertools.imap(sb.__contains__, a))` which should be faster still since it avoids using a lambda function.
Dave Kirby
Thanks, @Dave. :) I agree removing the lambda is a win.
Matthew Flaschen