tags:

views:

78

answers:

3

What is the most efficient way to find a sequence within a IEnumerable using Linq

I want to be able to create an extension method which allows the following call:

int startIndex = largeSequence.FindSequence(subSequence)

The match must be adjacent and in order.

A: 

UPDATE: Given the clarification of the question my response below isn't as applicable. Leaving it for historical purposes.

You probably want to use mySequence.Where(). Then the key is to optimize the predicate to work well in your environment. This can vary quite a bit depending on your requirements and typical usage patterns.

It is quite possible that what works well for small collections doesn't scale well for much larger collections depending on what type T is.

Of course, if the 90% use is for small collections then optimizing for the outlier large collection seems a bit YAGNI.

Foovanadil
I would like to see how you would use where, which take only a single element to check for a sequence match. Can you provide an example?
trampster
You're right, after re-reading your question and seeing the update my response isn't applicable.
Foovanadil
+2  A: 

The code you say you want to be able to use isn't LINQ, so I don't see why it need be implemented with LINQ.

This is essentially the same problem as substring searching (indeed, an enumeration where order is significant is a generalisation of "string").

Since computer science has considered this problem frequently for a long time, so you get to stand on the shoulders of giants.

Some reasonable starting points are:

http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

http://en.wikipedia.org/wiki/Rabin-karp

Even just the pseudocode in the wikipedia articles is enough to port to C# quite easily. Look at the descriptions of performance in different cases and decide which cases are most likely to be encountered by your code.

Jon Hanna
+1  A: 

Here's an implementation of an algorithm that finds a subsequence in a sequence. I called the method IndexOfSequence, because it makes the intent more explicit and is similar to the existing IndexOf method:

public static class ExtensionMethods
{
    public static int IndexOfSequence<T>(this IEnumerable<T> source, IEnumerable<T> sequence)
    {
        return source.IndexOfSequence(sequence, EqualityComparer<T>.Default);
    }

    public static int IndexOfSequence<T>(this IEnumerable<T> source, IEnumerable<T> sequence, IEqualityComparer<T> comparer)
    {
        var seq = sequence.ToArray();

        int p = 0; // current position in source sequence
        int i = 0; // current position in searched sequence
        var prospects = new List<int>(); // list of prospective matches
        foreach (var item in source)
        {
            // Remove bad prospective matches
            prospects.RemoveAll(k => !comparer.Equals(item, seq[p - k]));

            // Is it the start of a prospective match ?
            if (comparer.Equals(item, seq[0]))
            {
                prospects.Add(p);
            }

            // Does current character continues partial match ?
            if (comparer.Equals(item, seq[i]))
            {
                i++;
                // Do we have a complete match ?
                if (i == seq.Length)
                {
                    // Bingo !
                    return p - seq.Length + 1;
                }
            }
            else // Mismatch
            {
                // Do we have prospective matches to fall back to ?
                if (prospects.Count > 0)
                {
                    // Yes, use the first one
                    int k = prospects[0];
                    i = p - k + 1;
                }
                else
                {
                    // No, start from beginning of searched sequence
                    i = 0;
                }
            }
            p++;
        }
        // No match
        return -1;
    }
}

I didn't fully test it, so it might still contain bugs. I just did a few tests on well-known corner cases to make sure I wasn't falling into obvious traps. Seems to work fine so far...

I think the complexity is close to O(n), but I'm not an expert of Big O notation so I could be wrong... at least it only enumerates the source sequence once, whithout ever going back, so it should be reasonably efficient.

Thomas Levesque