views:

1063

answers:

8

I find that my program is searching through lots of lengthy strings (20,000+) trying to find a particular unique phrase.

What is the most efficent method for doing this in C#?

Below is the current code which works like this:

  1. The search begins at startPos because the target area is somewhat removed from the start
  2. It loops through the string, at each step it checks if the substring from that point starts with the startMatchString, which is an indicator that the start of the target string has been found. (The length of the target string varys).
  3. From here it creates a new substring (chopping off the 11 characters that mark the start of the target string) and searches for the endMatchString

I already know that this is a horribly complex and possibly very inefficent algorithm. What is a better way to accomplish the same result?

string result = string.Empty;    
for (int i = startPos; i <= response.Length - 1; i++)
{
   if (response.Substring(i).StartsWith(startMatchString))
   {
       string result = response.Substring(i).Substring(11);
       for (int j = 0; j <= result.Length - 1; j++)
       {
           if (result.Substring(j).StartsWith(endMatchString))
           {
               return result.Remove(j)
           }
       }
   }
}
return result;
+3  A: 

I would try to use a Regular Expression instead of rolling my own string search algorithm. You can precompile the regular expression to make it run faster.

driis
A: 

You could use a regex; it’s optimized for this kind of searching and manipulation.

You could also try IndexOf ...

string result = string.Empty;

if (startPos >= response.Length) 
    return result;

int startingIndex = response.IndexOf(startMatchString, startPos);
int rightOfStartIndex = startingIndex + startMatchString.Length;

if (startingIndex > -1 && rightOfStartIndex < response.Length)
{
    int endingIndex = response.IndexOf(endMatchString, rightOfStartIndex);
    if (endingIndex > -1)
        result = response.Substring(rightOfStartIndex, endingIndex - rightOfStartIndex);
}

return result;
Nescio
really? because I often hear that regex is considered inefficient
Harry
Try it first and optimize later if needed.
Ed Swangren
+6  A: 

It depends on what you're trying to find in the string. If you're looking for a specific sequence IndexOf/Contains are fast, but if you're looking for wild card patterns Regex is optimized for this kind of search.

Brian Rasmussen
The start and end matchstrings don't change. it is the string that lies between these two that is the point of interest.
Harry
@Harry: Is it fixed at runtime, but unknown at compile time or do you want to match wild cards? For the latter Regex would be my choice. Try that and profile it to see if it will be fast enough.
Brian Rasmussen
@Brian: the start and end matchstrings are fixed and known at compile time
Harry
A: 

For very long strings you cannot beat the boyer-moore search algorithm. It is more complex than I might try to explain here, but The CodeProject site has a pretty good article on it.

Khadaji
A: 

As said before regex is your friend. You might want to look at RegularExpressions.Group. This way you can name part of the matched resultset.

Here is an example

Ruben
+7  A: 

There are a whole bunch of algorithms,

  • boyer and moore
  • Sunday
  • Knuth-Morris-Pratt
  • Rabin-Karp

I would recommend to use the simplified Boyer-Moore, called Boyer–Moore–Horspool.

The C-code appears at the wikipedia. For the java code look at

http://www.fmi.uni-sofia.bg/fmi/logic/vboutchkova/sources/BoyerMoore_java.html

A nice article about these is available under http://www.ibm.com/developerworks/java/library/j-text-searching.html

If you want to use built-in stuff go for regular expressions.

Luixv
+2  A: 

You can use String.IndexOf, but make sure you use StringComparison.Ordinal or it may be one order of magnitude slower.

private string Search2(int startPos, string startMatchString, string endMatchString, string response) {
    int startMarch = response.IndexOf(startMatchString, startPos, StringComparison.Ordinal);
    if (startMarch != -1) {
        startMarch += startMatchString.Length;
        int endMatch = response.IndexOf(endMatchString, startMarch, StringComparison.Ordinal);
        if (endMatch != -1) { return response.Substring(startMarch, endMatch - startMarch); }
    }
    return string.Empty;
}

Searching 1000 times a string at about the 40% of a 183 KB file took about 270 milliseconds. Without StringComparison.Ordinal it took about 2000 milliseconds.
Searching 1 time with your method took over 60 seconds as it creates a new string (O(n)) each iteration, making your method O(n^2).

ggf31416
A: 

Here's an example using IndexOf (beware: written from the top of my head, didn't test it):

int skip = 11;
int start = response.IndexOf(startMatchString, startPos);
if (start >= 0)
{
   int end = response.IndexOf(startMatchString, start + skip);
   if (end >= 0)
       return response.Substring(start + skip, end - start - skip);
   else
       return response.Substring(start + skip);
}
return string.Empty;
oefe