views:

179

answers:

4

Hi,

I have two byte[] and I want to find the first occurrence of the second byte[] in the first byte[] (or a range in it).

I don't want to use strings for efficiency (translating the first byte[] to a string will be inefficient).

Basically I believe that's what strstr() does in C.

What is the best way to do that (so it be efficient and easy to use)?

This is how it should look like:

int GetOffsetOfArrayInArray(byte[] bigArray, int bigArrayOffset, int bigArrayCount, byte[] smallArray);

Thanks!

UPDATE:

I want a solution that would be more efficient than a simple search. This means that using the fact that comparing buffers can be more efficient should be used - memcmp() is more efficient than iterating over the bytes.

Also, I know there are algorithms that optimize scenarios like this one:

  • big array: "12312351231234"
  • small array: "1231234"
  • Naive algorithm: 7 compares to find that "1231235" is different than "1231234", 2 compares to find the next "1", 4 compares to find that "1235" is different than "1231", 3 compares to find the next "1", 7 compares to find match. A total of 7+2+4+3+7=23 compares.
  • Smart algorithm: 7 compares to find that "1231235" is different than "1231234", directly jumps to the next "1" (without comparing), 4 compares to find that "1235" is different than "1231", directly jumps beyond the "5", 7 compares to find the match. A total of 7+4+7=18 compares.
+1  A: 
int GetOffsetOfArrayInArray(byte[] bigArray, int bigArrayOffset, 
                               int bigArrayCount, byte[] smallArray)
{
     byte first = smallArray[0];
     bool cont= true;
     while (cont && 
            bigArrayOffset=Array.IndexOf(bigArray, first, bigArrayOffset) != -1)
     {
         if (bigArrayOffset + smallArray.Length > bigArray.Length)
         {
              bigArrayOffset = -1;
              break;
         }
         cont= false;
         for(int i=1; i< smallArray.Length; ++i)
         {
              if (bigArray[bigArrayOffset+i] != smallArray[i])
              { 
                 ++bigArrayOffset;
                 cont = true;
                 break;
              }
         }
     }
     return bigArrayOffset;
}

UPDATED; (hopefully) Fixed problem Henk alerted me to.

UPDATE 2: Addressing update to original question:

int GetOffsetOfArrayInArray(byte[] bigArray, int bigArrayOffset, 
                               int bigArrayCount, byte[] smallArray)
{
     int bigArrayEnd = Math.Min(bigArrayCount + bigArrayOffset, bigArray.Length)
     byte first = smallArray[0];
     bool cont= true;
     while (cont && 
            bigArrayOffset=Array.IndexOf(bigArray, first, bigArrayOffset) != -1)
     {
         int bookmark = bigArrauOffset + 1;
         bool bookmarkset = false;
         if (bigArrayOffset + smallArray.Length > bigArrayEnd )
         {
              bigArrayOffset = -1;
              break;
         }
         cont= false;
         for(int i=1; i< smallArray.Length; ++i)
         {
              if (!bookmarkset && bigArray[bigArrayOffset+i] == first)
              {
                   bookmark = bigArrayOffset+i;
                   bookmarkset = true;
              }
              if (bigArray[bigArrayOffset+i] != smallArray[i])
              { 
                 bigArrayOffset = bookmark;
                 cont = true;
                 break;
              }
         }
     }
     return bigArrayOffset;
}
James Curran
I voted +1 but I think it won't work (the `continue` applies to the `for` loop, you want it for the `while`).
Henk Holterman
I was afraid of that.... Let's see if I can hack a fix....
James Curran
Just a detail: bigArrayCount is not used.
Henk Holterman
Thanks, please look at my update.
brickner
+3  A: 

I don't have any code for you but the name of the fastest solution you will find is the Boyer-Moore algorithm. It can do better than O(n).

Here is an implementation for strings on CodeProject. Looks like a conversion to byte[] should not be too difficult.

Henk Holterman
An implementation of this in .NET would be great. But also the usage of memcmp to do fast compares would help...
brickner
"Better than O(n)" requires some caveats. *If* the lengths of the two strings are related, so that the target string length is proportional to the length of the string being searched, *and* your input is constrained to best cases, *then* it's constant-time. Worst case is O(n): better than the O(n*m) worst case of the naive search.
Steve Jessop
A: 

Here's my take on a solution. It's split in two. The first part mainly searches for a potential start. If it finds one it the compares the list from both ends (to lower the loop count which is basically a micro optimization with out a profiler but usually it's faster)

int GetOffsetOfArrayInArray(byte[] bigArray,
                        int bigArrayOffset,
                        int bigArrayCount,
                        byte[] smallArray)
    {
        var length = smallArray.Length;
        var lastPossibleStart = bigArray.Length - length;
        var startByte = smallArray[0];

        for (var first = bigArrayOffset; first < lastPossibleStart; first++)
        {
           if (bigArray[first] == startByte &&
               check(bigArray, smallArray, first, length))
           {
              return first;
           }
        }
        return -1;
    }

    bool check(byte[] bigArray, byte[] smallArray, int first, int length)
    {
        var smallIndex = 0;
        var smallLast = length - 1;
        var last = first + length - 1;
        for (var i = first; smallIndex <= smallLast; i++)
        {
            if (bigArray[i] != smallArray[smallIndex] ||
                 bigArray[last] != smallArray[smallLast])
            {
                return false;
            }
            smallIndex = i - first + 1;
            last--;
            smallLast--;
        }
        return true;
    }
}
Rune FS
This has multiple compile errors. When they're fixed in what seems like the most likely intended way, it suffers from an array out-of-bounds bug.
Daniel Earwicker
yeah a few compile errors there, nut I must be "code blind" I can't see the a-o-b
Rune FS
How about `smallArray[first - bigArrayOffset]`? For simplicity suppose we're searching the whole big array so `bigArrayOffset` is zero. Hence it's just `smallArray[first]`, where `first` is clearly going to get too big if the short pattern doesn't occur soon enough in the big array.
Daniel Earwicker
A: 

In algorithm theory, it is well known that optimizing for speed costs memory and vice versa. My algorithm uses a bit more memory (not much) but in return only scans the big array once.

public static int GetOffsetOfArrayInArray(byte[] bigArray, int bigArrayOffset, int bigArrayCount, byte[] smallArray)
{
    // TODO: Check whether none of the variables are null or out of range.
    if (smallArray.Length == 0)
        return 0;

    List<int> starts = new List<int>();    // Limited number of elements.

    int offset = bigArrayOffset;
    // A single pass through the big array.
    while (offset < bigArrayOffset + bigArrayCount)
    {
        for (int i = 0; i < starts.Count; i++)
        {
            if (bigArray[offset] != smallArray[offset - starts[i]])
            {
                // Remove starts[i] from the list.
                starts.RemoveAt(i);
                i--;
            }
            else if (offset - starts[i] == smallArray.Length - 1)
            {
                // Found a match.
                return starts[i];
            }
        }
        if (bigArray[offset] == smallArray[0] &&
            offset <= (bigArrayOffset + bigArrayCount - smallArray.Length))
        {
            if (smallArray.Length > 1)
                // Add the start to the list.
                starts.Add(offset);
            else
                // Found a match.
                return offset;
        }
        offset++;
    }
    return -1;
}

The list starts is used to track potential start offsets of smallArray in bigArray. It will never contain more elements than the number of occurrences of smallArray[0] in smallArray (which may be calculated in advance to optimize the list and reduce the number of memory reallocations). When there are not enough bytes left in bigArray to contain smallArray, it is not tried, and when smallArray has been found, the algorithm stops. It also stops when the end of bigArray has been reached. Therefor, the worst case running time would be O(1), and the memory usage would be O(1).

Further possible optimizations include using pointers in unsafe code, and replacing the list by a fixed array whose size can be calculated in advance (as noted before). However, because in the list wrong offsets are removed (smaller inner loop) and in an array wrong offsets have to be skipped (fixed size inner loop but possibly faster element access), you'd have to benchmark which one is faster.

It also matters whether you expect smallArray to be large or not. When you do, you could add a check to the while loop which checks whether starts.Length != 0 || offset <= (bigArrayOffset + bigArrayCount - smallArray.Length). Otherwise the loop may stop and no occurrences have been found.

Virtlink