views:

3170

answers:

15

Hi, Anyone know a good and effective way to search/match for a byte pattern in an byte[] array and then return the positions.

e.g.

byte[] pattern = new byte[] {12,3,5,76,8,0,6,125};

byte[] toBeSearched = new byte[] {23,36,43,76,125,56,34,234,12,3,5,76,8,0,6,125,234,56,211,122,22,4,7,89,76,64,12,3,5,76,8,0,6,125}

A: 

You can put the byte array into String and run match by IndexOf. Or you can at least reuse existing algorithms on string matching.

 [STAThread]
 static void Main(string[] args)
 {
  byte[] pattern = new byte[] {12,3,5,76,8,0,6,125};
  byte[] toBeSearched = new byte[] {23,36,43,76,125,56,34,234,12,3,5,76,8,0,6,125,234,56,211,122,22,4,7,89,76,64,12,3,5,76,8,0,6,125};
  string needle, haystack;

  unsafe 
  {
   fixed(byte * p = pattern) {
    needle = new string((SByte *) p, 0, pattern.Length);
   } // fixed

   fixed (byte * p2 = toBeSearched) 
   {
    haystack = new string((SByte *) p2, 0, toBeSearched.Length);
   } // fixed

   int i = haystack.IndexOf(needle, 0);
   System.Console.Out.WriteLine(i);
  }
 }
eed3si9n
Have you tried it? Let's see some code!
Mitch Wheat
hmmm, not sure about the need for unsafe! seems a bit drastic...
Mitch Wheat
your code only dinds the first occurrence, but the question implies all matches...
Mitch Wheat
I'm just glad it works. If ASCII covers the whole 8 bit, your code is cleaner.
eed3si9n
No, ASCII does not cover the whole 8-bit, it is 7-bit.
Constantin
@Constantin: thx, I've updated to use UTF8Encoding
Mitch Wheat
A: 

toBeSearched.Except(pattern) will return you differences toBeSearched.Intersect(pattern) will produce set of intersections Generally, you should look into extended methods within Linq extensions

Tamir
A: 

My solution:

class Program
{
    public static void Main()
    {
        byte[] pattern = new byte[] {12,3,5,76,8,0,6,125};

        byte[] toBeSearched = new byte[] { 23, 36, 43, 76, 125, 56, 34, 234, 12, 3, 5, 76, 8, 0, 6, 125, 234, 56, 211, 122, 22, 4, 7, 89, 76, 64, 12, 3, 5, 76, 8, 0, 6, 125};

        List<int> positions = SearchBytePattern(pattern, toBeSearched);

        foreach (var item in positions)
        {
            Console.WriteLine("Pattern matched at pos {0}", item);
        }

    }

    static public List<int> SearchBytePattern(byte[] pattern, byte[] bytes)
    {
        List<int> positions = new List<int>();
        int patternLength = pattern.Length;
        int totalLength = bytes.Length;
        byte firstMatchByte = pattern[0];
        for (int i = 0; i < totalLength; i++)
        {
            if (firstMatchByte == bytes[i] && totalLength - i >= patternLength)
            {
                byte[] match = new byte[patternLength];
                Array.Copy(bytes, i, match, 0, patternLength);
                if (match.SequenceEqual<byte>(pattern))
                {
                    positions.Add(i);
                    i += patternLength - 1;
                }
            }
        }
        return positions;
    }
}
bruno conde
why the array.copy? just gets slower this way.. I'm guessing it's just because you want to use the SequenceEqual, but this might be a little bit to much work just because you want to use a extension method. The "i += patternLength - 1;" part is nice!
Davy Landman
You should not give -1 to everybody just because the solution isn't perfect ... In this situation you should vote just on the solution that you think is the best.
bruno conde
+4  A: 

May I suggest something that doesn't involve creating strings, copying arrays or unsafe code:

using System;
using System.Collections.Generic;

static class ByteArrayRocks {

    static readonly int [] Empty = new int [0];

    public static int [] Locate (this byte [] self, byte [] candidate)
    {
     if (IsEmptyLocate (self, candidate))
      return Empty;

     var list = new List<int> ();

     for (int i = 0; i < self.Length; i++) {
      if (!IsMatch (self, i, candidate))
       continue;

      list.Add (i);
     }

     return list.Count == 0 ? Empty : list.ToArray ();
    }

    static bool IsMatch (byte [] array, int position, byte [] candidate)
    {
     if (candidate.Length > (array.Length - position))
      return false;

     for (int i = 0; i < candidate.Length; i++)
      if (array [position + i] != candidate [i])
       return false;

     return true;
    }

    static bool IsEmptyLocate (byte [] array, byte [] candidate)
    {
     return array == null
      || candidate == null
      || array.Length == 0
      || candidate.Length == 0
      || candidate.Length > array.Length;
    }

    static void Main ()
    {
     var data = new byte [] { 23, 36, 43, 76, 125, 56, 34, 234, 12, 3, 5, 76, 8, 0, 6, 125, 234, 56, 211, 122, 22, 4, 7, 89, 76, 64, 12, 3, 5, 76, 8, 0, 6, 125 };
     var pattern = new byte [] { 12, 3, 5, 76, 8, 0, 6, 125 };

     foreach (var position in data.Locate (pattern))
      Console.WriteLine (position);
    }
}
Jb Evain
A: 

Here's my (not the most performant) solution. It relies on the fact that bytes/latin-1 conversion is lossless, which is not true for bytes/ASCII or bytes/UTF8 conversions.

It's advantages are that It Works (tm) for any byte values (some other solutions work incorrectly with bytes 0x80-0xff) and can be extended to perform more advanced regex matching.

using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;

class C {

  public static void Main() {
    byte[] data = {0, 100, 0, 255, 100, 0, 100, 0, 255};
    byte[] pattern = {0, 255};
    foreach (int i in FindAll(data, pattern)) {
      Console.WriteLine(i);
    }
  }

  public static IEnumerable<int> FindAll(
    byte[] haystack,
    byte[] needle
  ) {
    // bytes <-> latin-1 conversion is lossless
    Encoding latin1 = Encoding.GetEncoding("iso-8859-1");
    string sHaystack = latin1.GetString(haystack);
    string sNeedle = latin1.GetString(needle);
    for (Match m = Regex.Match(sHaystack, Regex.Escape(sNeedle));
         m.Success; m = m.NextMatch()) {
      yield return m.Index;
    }
  }
}
Constantin
you shouldn't use strings and regular expressions for stuff like this, it's just abusing them.
Davy Landman
Davy, your remark is highly subjective. Regex is *the* tool for pattern matching and it's not my fault that .NET implementation doesn't accept byte arrays directly. By the way, some regex libraries don't have this limitation.
Constantin
+6  A: 

Out of curiosity, I've created a small benchmark with the different answers.

Here are the results for a million iterations:

solution [Locate]:            00:00:00.7714027
solution [FindAll]:           00:00:03.5404399
solution [SearchBytePattern]: 00:00:01.1105190
solution [MatchBytePattern]:  00:00:03.0658212
Jb Evain
A: 

I would use a solution which does matching by converting to a string...

You should write a simple function implementing the Knuth-Morris-Pratt searching algorithm. This will be the fastest simple algorithm you can use to find the correct indexes.(You could use Boyer-Moore but it will require more setup.

After you have optimized the algorithm, you could try to look for other kind of optimizations. But you should start with the basics.

For example, the current "fastest" is the Locate solution by Jb Evian.

if you look at the core

    for (int i = 0; i < self.Length; i++) {
            if (!IsMatch (self, i, candidate))
                    continue;

            list.Add (i);
    }

After a match of the sub algorithm, it will start to find a match at i + 1, but you already know that the first possible match would be i + candidate.Length. So if you add,

i += candidate.Length -2; //  -2 instead of -1 because the i++ will add the last index

it will be a lot faster when you expect a lot of occurrences of the subset in the superset. (Bruno Conde already does this in his solution)

But this is just a half of the KNP algorithm, you should also add an extra parameter to the IsMatch method called numberOfValidMatches which would be an out parameter.

this would resolve to the following:

int validMatches = 0;
if (!IsMatch (self, i, candidate, out validMatches))
{
    i += validMatches - 1; // -1 because the i++ will do the last one
    continue;
}

and

static bool IsMatch (byte [] array, int position, byte [] candidate, out int numberOfValidMatches)
{
    numberOfValidMatches = 0;
    if (candidate.Length > (array.Length - position))
            return false;

    for (i = 0; i < candidate.Length; i++)
    {
            if (array [position + i] != candidate [i])
                    return false;
            numberOfValidMatches++; 
    }

    return true;
}

A little bit of refactoring and you could use the numberOfValidMatches as the loop variable, and rewrite the Locate loop using a while to avoid the -2 and -1. But I just wanted to make clear how you could add the KMP algorithm.

Davy Landman
"but you already know that the first possible match would be i + candidate.Length" - that's not true - the candidate pattern could have repeats or loops that could cause overlappping matches.
Alnitak
that's the question, In my opinion you'd only want complete non overlapping matches.. This situation is only possible if one or more bytes at the end of the candidate array match the first bytes of the candidate array.
Davy Landman
+6  A: 

Use the efficient Boyer-Moore algorithm.

It's designed to find strings withing strings but you need little imagination to project this to byte arrays.

In general the best answer is: use any string searching algorithm that you like :).

VVS
+1  A: 

Jb Evain's answer has:

 for (int i = 0; i < self.Length; i++) {
      if (!IsMatch (self, i, candidate))
           continue;
      list.Add (i);
 }

and then the IsMatch function first checks whether candidate goes beyond the length of the array being searched.

This would be more efficient if the for loop were coded:

 for (int i = 0; i < self.Length - candidate.Length; i++) {
      if (!IsMatch (self, i, candidate))
           continue;
      list.Add (i);
 }

at this point one could also eliminate the test from the start of IsMatch, so long as you contract via pre-conditions never to call it with "illegal' parameters.

Alnitak
A: 

I've created a new function using the tips from my answer and the answer by Alnitak.

public static List<Int32> LocateSubset(Byte[] superSet, Byte[] subSet)
{
    if ((superSet == null) || (subSet == null))
    {
       throw new ArgumentNullException();
    }
    if ((superSet.Length < subSet.Length) || (superSet.Length == 0) || (subSet.Length == 0))
    {
        return new List<Int32>();
    }
    var result = new List<Int32>();
    Int32 currentIndex = 0;
    Int32 maxIndex =  superSet.Length - subSet.Length;
    while (currentIndex < maxIndex)
    {
         Int32 matchCount = CountMatches(superSet, currentIndex, subSet);
         if (matchCount ==  subSet.Length)
         {
            result.Add(currentIndex);
         }
         currentIndex++;
         if (matchCount > 0)
         {
            currentIndex += matchCount - 1;
         }
    }
    return result;
}

private static Int32 CountMatches(Byte[] superSet, int startIndex, Byte[] subSet)
{
    Int32 currentOffset = 0;
    while (currentOffset < subSet.Length)
    {
        if (superSet[startIndex + currentOffset] != subSet[currentOffset])
        {
            break;
        }
        currentOffset++;
    }
    return currentOffset;
}

the only part I'm not so happy about is the

         currentIndex++;
         if (matchCount > 0)
         {
            currentIndex += matchCount - 1;
         }

part... I would have like to use a if else to avoid the -1, but this results in a better branch prediction (although I'm not that sure it will matter that much)..

Davy Landman
A: 

Why make the simple difficult? This can be done in any language using for loops. Here is one in C#:

using System;
using System.Collections.Generic;

namespace BinarySearch
{
    class Program
    {
        static void Main(string[] args)
        {
            byte[] pattern = new byte[] {12,3,5,76,8,0,6,125};
            byte[] toBeSearched = new byte[] {23,36,43,76,125,56,34,234,12,3,5,76,8,0,6,125,234,56,211,
122,22,4,7,89,76,64,12,3,5,76,8,0,6,125}; List<int> occurences = findOccurences(toBeSearched, pattern); foreach(int occurence in occurences) { Console.WriteLine("Found match starting at 0-based index: " + occurence); } } static List<int> findOccurences(byte[] haystack, byte[] needle) { List<int> occurences = new List<int>(); for (int i = 0; i < haystack.Length; i++) { if (needle[0] == haystack[i]) { bool found = true; int j, k; for (j = 0, k = i; j < needle.Length; j++, k++) { if (k >= haystack.Length || needle[j] != haystack[k]) { found = false; break; } } if (found) { occurences.Add(i - 1); i = k; } } } return occurences; } } }
A: 

thanks for taking the time...

This is the code that i was using/testing with before I asked my question... The reason I ask this question was that I´m certain of that I´m not using the optimal code to do this... so thanks again for taking the time!

   private static int CountPatternMatches(byte[] pattern, byte[] bytes)
   {
        int counter = 0;

        for (int i = 0; i < bytes.Length; i++)
        {
            if (bytes[i] == pattern[0] && (i + pattern.Length) < bytes.Length)
            {
                for (int x = 1; x < pattern.Length; x++)
                {
                    if (pattern[x] != bytes[x+i])
                    {
                        break;
                    }

                    if (x == pattern.Length -1)
                    {
                        counter++;
                        i = i + pattern.Length;
                    }
                }
            }
        }

        return counter;
    }

Anyone that see any errors in my code? Is this considered as an hackish approach? I have tried almost every sample you guys have posted and I seems to get some variations in the match results. I have been running my tests with ~10Mb byte array as my toBeSearched array.

Anders R
A: 

Originally I posted some old code I used but was curious about Jb Evain's benchmarks. I found that my solution was stupid slow. It appears that bruno conde's SearchBytePattern is the fastest. I could not figure why especially since he uses an Array.Copy and an Extension method. But there is proof in Jb's tests, so kudos to bruno.

I simplified the bits even further, so hopefully this will be the clearest and simplest solution. (All the hard work done by bruno conde) The enhancements are:

  • Buffer.BlockCopy
  • Array.IndexOf<byte>
  • while loop instead of a for loop
  • start index parameter
  • converted to extension method

    public static List<int> IndexOfSequence(this byte[] buffer, byte[] pattern, int startIndex)    
    {
       List<int> positions = new List<int>();
       int i = Array.IndexOf<byte>(buffer, pattern[0], startIndex);  
       while (i >= 0 && i <= buffer.Length - pattern.Length)  
       {
          byte[] segment = new byte[pattern.Length];
          Buffer.BlockCopy(buffer, i, segment, 0, pattern.Length);    
          if (segment.SequenceEqual<byte>(pattern))
               positions.Add(i);
          i = Array.IndexOf<byte>(buffer, pattern[0], i + pattern.Length);
       }
       return positions;    
    }
    
GoClimbColorado
A: 

Speed isn't everything. Did you check them for consistency?

I didn't test all the code listed here. I tested my own code (which wasn't totally consistent, I admit) and IndexOfSequence. I found that for many tests IndexOfSequence was quite a bit faster than my code but with repeated testing I found that it was less consistent. In particular it seems to have the most trouble with finding patterns at the end of the array but it will miss them in the middle of the array too sometimes.

My test code isn't designed for efficiency, I just wanted to have a bunch of random data with some known strings inside. That test pattern is roughly like a boundary marker in an http form upload stream. That's what I was looking for when I ran across this code so I figured I'd test it with the kind of data I'll be searching for. It appears that the longer the pattern is the more often IndexOfSequence will miss a value.

private static void TestMethod()
{
    Random rnd = new Random(DateTime.Now.Millisecond);
    string Pattern = "-------------------------------65498495198498";
    byte[] pattern = Encoding.ASCII.GetBytes(Pattern);

    byte[] testBytes;
    int count = 3;
    for (int i = 0; i < 100; i++)
    {
        StringBuilder TestString = new StringBuilder(2500);
        TestString.Append(Pattern);
        byte[] buf = new byte[1000];
        rnd.NextBytes(buf);
        TestString.Append(Encoding.ASCII.GetString(buf));
        TestString.Append(Pattern);
        rnd.NextBytes(buf);
        TestString.Append(Encoding.ASCII.GetString(buf));
        TestString.Append(Pattern);
        testBytes = Encoding.ASCII.GetBytes(TestString.ToString());

        List<int> idx = IndexOfSequence(ref testBytes, pattern, 0);
        if (idx.Count != count)
        {
            Console.Write("change from {0} to {1} on iteration {2}: ", count, idx.Count, i);
            foreach (int ix in idx)
            {
                Console.Write("{0}, ", ix);
            }
            Console.WriteLine();
            count = idx.Count;
        }
    }

    Console.WriteLine("Press ENTER to exit");
    Console.ReadLine();
}

(obviously I converted IndexOfSequence from an extension back into a normal method for this testing)

Here's a sample run of my output:

change from 3 to 2 on iteration 1: 0, 2090,
change from 2 to 3 on iteration 2: 0, 1045, 2090,
change from 3 to 2 on iteration 3: 0, 1045,
change from 2 to 3 on iteration 4: 0, 1045, 2090,
change from 3 to 2 on iteration 6: 0, 2090,
change from 2 to 3 on iteration 7: 0, 1045, 2090,
change from 3 to 2 on iteration 11: 0, 2090,
change from 2 to 3 on iteration 12: 0, 1045, 2090,
change from 3 to 2 on iteration 14: 0, 2090,
change from 2 to 3 on iteration 16: 0, 1045, 2090,
change from 3 to 2 on iteration 17: 0, 1045,
change from 2 to 3 on iteration 18: 0, 1045, 2090,
change from 3 to 1 on iteration 20: 0,
change from 1 to 3 on iteration 21: 0, 1045, 2090,
change from 3 to 2 on iteration 22: 0, 2090,
change from 2 to 3 on iteration 23: 0, 1045, 2090,
change from 3 to 2 on iteration 24: 0, 2090,
change from 2 to 3 on iteration 25: 0, 1045, 2090,
change from 3 to 2 on iteration 26: 0, 2090,
change from 2 to 3 on iteration 27: 0, 1045, 2090,
change from 3 to 2 on iteration 43: 0, 1045,
change from 2 to 3 on iteration 44: 0, 1045, 2090,
change from 3 to 2 on iteration 48: 0, 1045,
change from 2 to 3 on iteration 49: 0, 1045, 2090,
change from 3 to 2 on iteration 50: 0, 2090,
change from 2 to 3 on iteration 52: 0, 1045, 2090,
change from 3 to 2 on iteration 54: 0, 1045,
change from 2 to 3 on iteration 57: 0, 1045, 2090,
change from 3 to 2 on iteration 62: 0, 1045,
change from 2 to 3 on iteration 63: 0, 1045, 2090,
change from 3 to 2 on iteration 72: 0, 2090,
change from 2 to 3 on iteration 73: 0, 1045, 2090,
change from 3 to 2 on iteration 75: 0, 2090,
change from 2 to 3 on iteration 76: 0, 1045, 2090,
change from 3 to 2 on iteration 78: 0, 1045,
change from 2 to 3 on iteration 79: 0, 1045, 2090,
change from 3 to 2 on iteration 81: 0, 2090,
change from 2 to 3 on iteration 82: 0, 1045, 2090,
change from 3 to 2 on iteration 85: 0, 2090,
change from 2 to 3 on iteration 86: 0, 1045, 2090,
change from 3 to 2 on iteration 89: 0, 2090,
change from 2 to 3 on iteration 90: 0, 1045, 2090,
change from 3 to 2 on iteration 91: 0, 2090,
change from 2 to 1 on iteration 92: 0,
change from 1 to 3 on iteration 93: 0, 1045, 2090,
change from 3 to 1 on iteration 99: 0,

I don't mean to pick on IndexOfSequence, it just happened to be the one I started working with today. I noticed at the end of the day that it seemed to be missing patterns in the data so I wrote my own pattern matcher tonight. It's not as fast though. I'm going to tweak it a bit more to see if I can get it 100% consistent before I post it.

I just wanted to remind everyone that they should test things like this to make sure they give good, repeatable results before you trust them in production code.

Steve Hiner
A: 

I tried various solutions and ended up modifying the SearchBytePattern one. I tested on a 30k sequence and it is, fast :)

    static public int SearchBytePattern(byte[] pattern, byte[] bytes)
    {
        int matches = 0;
        for (int i = 0; i < bytes.Length; i++)
        {
            if (pattern[0] == bytes[i] && bytes.Length - i >= pattern.Length)
            {
                bool ismatch = true;
                for (int j = 1; j < pattern.Length && ismatch == true; j++)
                {
                    if (bytes[i + j] != pattern[j])
                        ismatch = false;
                }
                if (ismatch)
                {
                    matches++;
                    i += pattern.Length - 1;
                }
            }
        }
        return matches;
    }

Let me know your thoughts.

Sorin S.