views:

55

answers:

3

I need to parse the bytes from a file so that I only take the data after a certain sequence of bytes has been identified. For example, if the sequence is simply 0xFF (one byte), then I can use LINQ on the collection:

byte[] allBytes = new byte[] {0x00, 0xFF, 0x01};
var importantBytes = allBytes.SkipWhile(byte b => b != 0xFF);
// importantBytes = {0xFF, 0x01}

But is there an elegant way to detect a multi-byte sequence - e.g. 0xFF, 0xFF - especially one that backtracks in case it starts to get a false positive match?

+1  A: 

If you convert your bytes into a string, you can take advantage of the myriad of searching functions built into that, even if the bytes you're working with aren't actually characters in the traditional sense.

MikeP
Wouldn't you have to worry about what .NET might assume about encoding and such which would give wrong results?
thelsdj
I believe that as long you're searching for an exact byte sequence, the encoding isn't really going to matter (as long as both the source and the search sequence are in the same encoding). You can use the ASCIIEncoding class to help convert back and forth.
MikeP
+1  A: 

I'm not aware of any built-in way; as per usual, you can always write your own extension method. Here's one off the top of my head (there may be more efficient ways to implement it):

public static IEnumerable<T> AfterSequence<T>(this IEnumerable<T> source,
    T[] sequence)
{
    bool sequenceFound = false;
    Queue<T> currentSequence = new Queue<T>(sequence.Length);
    foreach (T item in source)
    {
        if (sequenceFound)
        {
            yield return item;
        }
        else
        {
            currentSequence.Enqueue(item);

            if (currentSequence.Count < sequence.Length)
                continue;

            if (currentSequence.Count > sequence.Length)
                currentSequence.Dequeue();

            if (currentSequence.SequenceEqual(sequence))
                sequenceFound = true;
        }
    }
}

I'll have to check to make sure that this is correct, but it should give you the basic idea; iterate through the elements, track the last sequence of values retrieved, set a flag when the sequence is found, and once the flag is set, start returning each subsequent element.

Edit - I did run a test, and it does work correctly. Here's some test code:

static void Main(string[] args)
{
    byte[] data = new byte[]
    {
        0x01, 0x02, 0x03, 0x04, 0x05,
        0xFF, 0xFE, 0xFD, 0xFC, 0xFB, 0xFA
    };
    byte[] sequence = new byte[] { 0x02, 0x03, 0x04, 0x05 };
    foreach (byte b in data.AfterSequence(sequence))
    {
        Console.WriteLine(b);
    }
    Console.ReadLine();
}
Aaronaught
A: 

Just as a bit of theory; this is a regular language problem. You may be able to use a regular expression engine to detect it. The first google hit for "regular expression on stream" found

http://codeguru.earthweb.com/columns/experts/article.php/c14689

Steve Cooper