views:

38

answers:

2

I have a UTF-8 byte array of data. I would like to search for a specific string in the array of bytes in C#.

byte[] dataArray = (some UTF-8 byte array of data);

string searchString = "Hello";

How do I find the first occurrence of the word "Hello" in the array dataArray and return an index location where the string begins (where the 'H' from 'Hello' would be located in dataArray)?

Before, I was erroneously using something like:

int helloIndex = Encoding.UTF8.GetString(dataArray).IndexOf("Hello");

Obviously, that code would not be guaranteed to work since I am returning the index of a String, not the index of the UTF-8 byte array. Are there any built-in C# methods or proven, efficient code I can reuse?

Thanks,

Matt

A: 

One of the nice features about UTF-8 is that if a sequence of bytes represents a character and that sequence of bytes appears anywhere in valid UTF-8 encoded data then it always represents that character.

Knowing this, you can convert the string you are searching for to a byte array and then use the Boyer-Moore string searching algorithm (or any other string searching algorithm you like) adapted slightly to work on byte arrays instead of strings.

There are a number of answers here that can help you:

Mark Byers
A: 

Try the following snippet:

// Setup our little test.

string sourceText = "ʤhello";

byte[] searchBytes = Encoding.UTF8.GetBytes(sourceText);

// Convert the bytes into a string we can search in.

string searchText = Encoding.UTF8.GetString(searchBytes);

int position = searchText.IndexOf("hello");

// Get all text that is before the position we found.

string before = searchText.Substring(0, position);

// The length of the encoded bytes is the actual number of UTF8 bytes
// instead of the position.

int bytesBefore = Encoding.UTF8.GetBytes(before).Length;

// This outputs Position is 1 and before is 2.

Console.WriteLine("Position is {0} and before is {1}", position, bytesBefore);
Pieter