views:

261

answers:

9

How can I test if a list contains another list. Say there was a function called contains:

contains([1,2], [-1, 0, 1, 2]) # Returns [2, 3] (conatins returns [start, end])
contains([1,3], [-1, 0, 1, 2]) # Returns False
contains([1, 2], [[1, 2], 3) # Returns False
contains([[1, 2]], [[1, 2], 3]) # Returns [0, 0]

Edit:

contains([2, 1], [-1, 0, 1, 2]) # Returns False
contains([-1, 1, 2], [-1, 0, 1, 2]) # Returns False
contains([0, 1, 2], [-1, 0, 1, 2]) # Returns [1, 3]
A: 

[Note this answer doesn't work. My bad.]

If all items are unique, you can use sets.

>>> items = set([-1, 0, 1, 2])
>>> items.issubset(set([1, 2]))
True
>>> items.issubset(set([1, 3]))
False
Thomas O
that's not what op is looking for
SilentGhost
It would work for unique lists. In fact it would work for non-unique items, but it wouldn't be able to determine between the individual items. (so you couldn't compare between [1, 1, 2] and [1, 2].)
Thomas O
Okay, that's why I used the qualifier "if all items are unique." I didn't realise, as your example didn't make it clear that you needed distinction between identical items.
Thomas O
It still won't work, even if all items are unique, unless they are also in the same order, with no intermingled items. e.g. finding `[1, 3]` or `[2, 1]` in `[1, 2, 3]` will give a false positive. Assuming that we're looking for the sequence itself rather than just the values contained in the sequence.
intuited
Oh. Thanks for pointing that out. eumiro's answer looks better then.
Thomas O
@Thomas O When you post an answer that is a) wrong, b) not what OP is looking for or C) all of the above, there is a 'delete' link at the bottom left of your post. feel free to click it and 1) declutter the thread, 2) avoid exposing yourself to further -1's.
aaronasterling
I think it's better to show how an idea can be wrong (so as to avoid it in future), rather than erasing it entirely.
Thomas O
+3  A: 

After OP's edit:

def contains(small, big):
    for i in xrange(1 + len(big) - len(small)):
        if small == big[i:i+len(small)]:
            return i, i + len(small) - 1
    return False
eumiro
But it fails with contains([1,2], [-1, 0, 1, 1, 2]) which returns [2,4] instead of what I assume is the expected [3,4]
Andrew Dalke
Now it works with all OP's tests.
eumiro
This is going to be horribly inefficient for big lists, since it is constantly creating and destroying temporary lists every time it does `big[i:i+len(small)]`
Dave Kirby
@Dave: did you timeit this against your solution?
SilentGhost
According to my tests, this has slightly better performance than Dave Kirby's solution, even on large lists (1 million elements, with the matching subset at the end): 4.1s for 10 repetitions versus 5.6s for Dave's. I would love to post my test code, but there isn't an easy way to do that.
Tim Yates
@Tim Yates: Could you post your test code in a [gist](gist.github.com) or other pastebin-type deal? Maybe [codepad](codepad.org) would be a better option, though I don't know if running it on that site would be a) possible; b) reliable. It seems to be borken at the moment anyway.
intuited
**UPDATE**: I spoke too soon--my small lists were too small. This algorithm exploded once I increased their size to 1000, while the others stayed constant. It looks like Dave Kirby's wins for large lists after all. http://pastebin.com/NZwU6PUx
Tim Yates
This should return a tuple, not a list. A list is completely inappropriate for this usage. Corrected in the original post.
chpwn
+2  A: 

Here is my version:

def contains(small, big):
    for i in xrange(len(big)-len(small)+1):
        for j in xrange(len(small)):
            if big[i+j] != small[j]:
                break
        else:
            return i, i+len(small)
    return False

It returns a tuple of (start, end+1) since I think that is more pythonic, as Andrew Jaffe points out in his comment. It does not slice any sublists so should be reasonably efficient.

One point of interest for newbies is that it uses the else clause on the for statement - this is not something I use very often but can be invaluable in situations like this.

This is identical to finding substrings in a string, so for large lists it may be more efficient to implement something like the Boyer-Moore algorithm.

Dave Kirby
+1 for the note about efficient string searching algorithms. One disadvantage of yours is the addition of an interpreted inner loop (the slice comparison is, I imagine, faster, although the copy might offset that). I'm going to try a performance comparison.
Tim Yates
After further tests, yours *is* the best so far for large subsequences. I would pick this, even despite the small disadvantage is has on smaller data sets.
Tim Yates
A: 

This works and is fairly fast since it does the linear searching using the builtin list.index() method and == operator:

def contains(sub, pri):
    M, N = len(pri), len(sub)
    i, LAST = 0, M-N+1
    while True:
        try:
            found = pri.index(sub[0], i, LAST) # find first elem in sub
        except ValueError:
            return False
        if pri[found:found+N] == sub:
            return [found, found+N-1]
        else:
            i = found+1
martineau
A: 

I tried to make this as efficient as possible.

It uses a generator; those unfamiliar with these beasts are advised to check out their documentation and that of yield expressions.

Basically it creates a generator of values from the subsequence that can be reset by sending it a true value. If the generator is reset, it starts yielding again from the beginning of sub.

Then it just compares successive values of sequence with the generator yields, resetting the generator if they don't match.

When the generator runs out of values, i.e. reaches the end of sub without being reset, that means that we've found our match.

Since it works for any sequence, you can even use it on strings, in which case it behaves similarly to str.find, except that it returns False instead of -1.

As a further note: I think that the second value of the returned tuple should, in keeping with Python standards, normally be one higher. i.e. "string"[0:2] == "st". But the spec says otherwise, so that's how this works.

It depends on if this is meant to be a general-purpose routine or if it's implementing some specific goal; in the latter case it might be better to implement a general-purpose routine and then wrap it in a function which twiddles the return value to suit the spec.

def reiterator(sub):
    """Yield elements of a sequence, resetting if sent ``True``."""
    it = iter(sub)
    while True:
        if (yield it.next()):
            it = iter(sub)

def find_in_sequence(sub, sequence):
    """Find a subsequence in a sequence.

    >>> find_in_sequence([2, 1], [-1, 0, 1, 2])
    False
    >>> find_in_sequence([-1, 1, 2], [-1, 0, 1, 2])
    False
    >>> find_in_sequence([0, 1, 2], [-1, 0, 1, 2])
    (1, 3)
    >>> find_in_sequence("subsequence",
    ...                  "This sequence contains a subsequence.")
    (25, 35)
    >>> find_in_sequence("subsequence", "This one doesn't.")
    False

    """
    start = None
    sub_items = reiterator(sub)
    sub_item = sub_items.next()
    for index, item in enumerate(sequence):
        if item == sub_item:
            if start is None: start = index
        else:
            start = None
        try:
            sub_item = sub_items.send(start is None)
        except StopIteration:
            # If the subsequence is depleted, we win!
            return (start, index)
    return False
intuited
A valiant effort, but this has worse performance than either eumiro or Dave Kirby's solutions. 8.2s on the benchmark I described in the other comments.
Tim Yates
Interesting to see the speed difference for native code. It seems like this algorithm would be relatively faster for longer subsequences.. how long was/were the subsequence(s) you used in the test?
intuited
You're right. This performed much better than eumiro's solution with larger subsequences, but it still didn't perform as well as Dave's.
Tim Yates
A: 

Here's a straightforward algorithm that uses list methods:

#!/usr/bin/env python

def list_find(what, where):
    """Find `what` list in the `where` list.

    Return index in `where` where `what` starts
    or -1 if no such index.

    >>> f = list_find
    >>> f([2, 1], [-1, 0, 1, 2])
    -1
    >>> f([-1, 1, 2], [-1, 0, 1, 2])
    -1
    >>> f([0, 1, 2], [-1, 0, 1, 2])
    1
    >>> f([1,2], [-1, 0, 1, 2])
    2
    >>> f([1,3], [-1, 0, 1, 2])
    -1
    >>> f([1, 2], [[1, 2], 3])
    -1
    >>> f([[1, 2]], [[1, 2], 3])
    0
    """
    if not what: # empty list is always found
        return 0
    try:
        index = 0
        while True:
            index = where.index(what[0], index)
            if where[index:index+len(what)] == what:
                return index # found
            index += 1 # try next position
    except ValueError:
        return -1 # not found

def contains(what, where):
    """Return [start, end+1] if found else empty list."""
    i = list_find(what, where)
    return [i, i + len(what)] if i >= 0 else [] #NOTE: bool([]) == False

if __name__=="__main__":
    import doctest; doctest.testmod()
J.F. Sebastian
A: 
inspectorG4dget
A: 

I think this one is fast...

def issublist(subList, myList, start=0):
    if not subList: return 0
    lenList, lensubList = len(myList), len(subList)
    try:
        while lenList - start >= lensubList:
            start = myList.index(subList[0], start)
            for i in xrange(lensubList):
                if myList[start+i] != subList[i]:
                    break
            else:
                return start, start + lensubList
            start += 1
        return False
    except:
        return False
A: 

May I humbly suggest the Rabin-Karp algorithm if the big list is really big. The link even contains almost-usable code in almost-Python.

9000