views:

576

answers:

3

How do you think what is the best way to find position in the System.Stream where given byte sequence starts (first occurence):

public static long FindPosition(Stream stream, byte[] byteSequence)
{
    long position = -1;

    /// ???
    return position;
}

P.S. The simpliest yet fastest solution is preffered. :)

+2  A: 

You'll basically need to keep a buffer the same size as byteSequence so that once you've found that the "next byte" in the stream matches, you can check the rest but then still go back to the "next but one" byte if it's not an actual match.

It's likely to be a bit fiddly whatever you do, to be honest :(

Jon Skeet
+1  A: 

If you treat the stream like another sequence of bytes, you can just search it like you were doing a string search. Wikipedia has a great article on that. Boyer-Moore is a good and simple algorithm for this.

Here's a quick hack I put together in Java. It works and it's pretty close if not Boyer-Moore. Hope it helps ;)

public static final int BUFFER_SIZE = 32;

public static int [] buildShiftArray(byte [] byteSequence){
 int [] shifts = new int[byteSequence.length];
 int [] ret;
 int shiftCount = 0;
 byte end = byteSequence[byteSequence.length-1];
 int index = byteSequence.length-1;
 int shift = 1;

 while(--index >= 0){
  if(byteSequence[index] == end){
   shifts[shiftCount++] = shift;
   shift = 1;
  } else {
   shift++;
  }
 }
 ret = new int[shiftCount];
 for(int i = 0;i < shiftCount;i++){
  ret[i] = shifts[i];
 }
 return ret;
}

public static byte [] flushBuffer(byte [] buffer, int keepSize){
 byte [] newBuffer = new byte[buffer.length];
 for(int i = 0;i < keepSize;i++){
  newBuffer[i] = buffer[buffer.length - keepSize + i];
 }
 return newBuffer;
}

public static int findBytes(byte [] haystack, int haystackSize, byte [] needle, int [] shiftArray){
 int index = needle.length;
 int searchIndex, needleIndex, currentShiftIndex = 0, shift;
 boolean shiftFlag = false;

 index = needle.length;
 while(true){
  needleIndex = needle.length-1;
  while(true){
   if(index >= haystackSize)
    return -1;
   if(haystack[index] == needle[needleIndex])
    break;
   index++;
  }
  searchIndex = index;
  needleIndex = needle.length-1;
  while(needleIndex >= 0 && haystack[searchIndex] == needle[needleIndex]){
   searchIndex--;
   needleIndex--;
  }
  if(needleIndex < 0)
   return index-needle.length+1;
  if(shiftFlag){
   shiftFlag = false;
   index += shiftArray[0];
   currentShiftIndex = 1;
  } else if(currentShiftIndex >= shiftArray.length){
   shiftFlag = true;
   index++;
  } else{
   index += shiftArray[currentShiftIndex++];
  }   
 }
}

public static int findBytes(InputStream stream, byte [] needle){
 byte [] buffer = new byte[BUFFER_SIZE];
 int [] shiftArray = buildShiftArray(needle);
 int bufferSize, initBufferSize;
 int offset = 0, init = needle.length;
 int val;

 try{
  while(true){
   bufferSize = stream.read(buffer, needle.length-init, buffer.length-needle.length+init);
   if(bufferSize == -1)
    return -1;
   if((val = findBytes(buffer, bufferSize+needle.length-init, needle, shiftArray)) != -1)
    return val+offset;
   buffer = flushBuffer(buffer, needle.length);
   offset += bufferSize-init;
   init = 0;
  }
 } catch (IOException e){
  e.printStackTrace();
 }
 return -1;
}
dharga
it might not be simplest, but it's pretty fast. it think given the constraints of reading from a stream doesn't allow for simple if you want speed. but I hope my code can alleviate some of your troubles, or be of help to someone in the future.
dharga
+1  A: 

I've reached this solution.

I did some benchmarks with an ASCII file that was 3.050 KB and 38803 lines. With a search byte array of 22 bytes in the last line of the file I've got the result in about 2.28 seconds (in a slow/old machine).

public static long FindPosition(Stream stream, byte[] byteSequence)
{
    if (byteSequence.Length > stream.Length)
        return -1;

    byte[] buffer = new byte[byteSequence.Length];

    using (BufferedStream bufStream = new BufferedStream(stream, byteSequence.Length))
    {
        int i;
        while ((i = bufStream.Read(buffer, 0, byteSequence.Length)) == byteSequence.Length)
        {
            if (byteSequence.SequenceEqual(buffer))
                return bufStream.Position - byteSequence.Length;
            else
                bufStream.Position -= byteSequence.Length - PadLeftSequence(buffer, byteSequence);
        }
    }

    return -1;
}

private static int PadLeftSequence(byte[] bytes, byte[] seqBytes)
{
    int i = 1;
    while (i < bytes.Length)
    {
        int n = bytes.Length - i;
        byte[] aux1 = new byte[n];
        byte[] aux2 = new byte[n];
        Array.Copy(bytes, i, aux1, 0, n);
        Array.Copy(seqBytes, aux2, n);
        if (aux1.SequenceEqual(aux2))
            return i;
        i++;
    }
    return i;
}
bruno conde