views:

469

answers:

6

This is a generalization of the "string contains substring" problem to (more) arbitrary types.

Given an sequence (such as a list or tuple), what's the best way of determining whether another sequence is inside it? As a bonus, it should return the index of the element where the subsequence starts:

Example usage (Sequence in Sequence):

>>> seq_in_seq([5,6],  [4,'a',3,5,6])
3
>>> seq_in_seq([5,7],  [4,'a',3,5,6])
-1 # or None, or whatever

So far, I just rely on brute force and it seems slow, ugly, and clumsy.

+1  A: 

Brute force may be fine for small patterns.

For larger ones, look at the Aho-Corasick algorithm.

Doug Currie
Aho-Corasick would be great. I'm specifically looking for python, or pythonish solutions... so if there were an implementation, that would be great. I'll poke around.
Gregg Lind
+5  A: 

Same thing as string matching sir...Knuth-Morris-Pratt string matching

nlucaroni
+1  A: 
>>> def seq_in_seq(subseq, seq):
...     while subseq[0] in seq:
...         index = seq.index(subseq[0])
...         if subseq == seq[index:index + len(subseq)]:
...             return index
...         else:
...             seq = seq[index + 1:]
...     else:
...         return -1
... 
>>> seq_in_seq([5,6], [4,'a',3,5,6])
3
>>> seq_in_seq([5,7], [4,'a',3,5,6])
-1

Sorry I'm not an algorithm expert, it's just the fastest thing my mind can think about at the moment, at least I think it looks nice (to me) and I had fun coding it. ;-)

Most probably it's the same thing your brute force approach is doing.

mcella
It is nice an clean, but brute-forcy --> O(mn)
Gregg Lind
+8  A: 

I second the Knuth-Morris-Pratt algorithm. By the way, your problem (and the KMP solution) is exactly recipe 5.13 in Python Cookbook 2nd edition. You can find the related code at http://code.activestate.com/recipes/117214/

It finds all the correct subsequences in a given sequence, and should be used as an iterator:

>>> for s in KnuthMorrisPratt([4,'a',3,5,6], [5,6]): print s
3
>>> for s in KnuthMorrisPratt([4,'a',3,5,6], [5,7]): print s
(nothing)
Federico Ramponi
Note that the KMP implementation given on code.activestate was demostrably slower by 30-500 times for some (perhaps unrepresentative input). Benchmarking to see if dumb built-in methods outperform seems to be a good idea!
Alabaster Codify
KMP is known to be about twice as slow as the naive algorithm in practice. Hence, for most purposes it’s *completely* inappropriate, despite its good asymptotic worst-case runtime.
Konrad Rudolph
A: 

Here is another KMP implementation:

def seq_in_seq(seq1,seq2):
    '''
    Return the index where seq1 appears in seq2, or -1 if 
    seq1 is not in seq2, using the Knuth-Morris-Pratt algorithm

    based heavily on code by Neale Pickett <[email protected]>
    found at:  woozle.org/~neale/src/python/kmp.py

    >>> seq_in_seq(range(3),range(5))
    0
    >>> seq_in_seq(range(3)[-1:],range(5))
    2
    >>>seq_in_seq(range(6),range(5))
    -1
    '''
    def compute_prefix_function(p):
        m = len(p)
        pi = [0] * m
        k = 0
        for q in xrange(1, m):
            while k > 0 and p[k] != p[q]:
                k = pi[k - 1]
            if p[k] == p[q]:
                k = k + 1
            pi[q] = k
        return pi

    t,p = list(tee(seq2)[0]), list(tee(seq1)[0])
    m,n = len(p),len(t)
    pi = compute_prefix_function(p)
    q = 0
    for i in range(n):
        while q > 0 and p[q] != t[i]:
            q = pi[q - 1]
        if p[q] == t[i]:
            q = q + 1
        if q == m:
            return i - m + 1
    return -1
Gregg Lind
+2  A: 

Here's a brute-force approach O(n*m) (similar to @mcella's answer). It might be faster then the Knuth-Morris-Pratt algorithm implementation in pure Python O(n+m) (see @Gregg Lind answer) for small input sequences.

#!/usr/bin/env python
def index(subseq, seq):
    """Return an index of `subseq`uence in the `seq`uence.

    Or `-1` if `subseq` is not a subsequence of the `seq`.

    The time complexity of the algorithm is O(n*m), where

        n, m = len(seq), len(subseq)

    >>> index([1,2], range(5))
    1
    >>> index(range(1, 6), range(5))
    -1
    >>> index(range(5), range(5))
    0
    >>> index([1,2], [0, 1, 0, 1, 2])
    3
    """
    i, n, m = -1, len(seq), len(subseq)
    try:
        while True:
            i = seq.index(subseq[0], i + 1, n - m + 1)
            if subseq == seq[i:i + m]:
               return i
    except ValueError:
        return -1

if __name__ == '__main__':
    import doctest; doctest.testmod()

I wonder how large is the small in this case?

J.F. Sebastian