views:

263

answers:

7

Possible Duplicate:
What algorithm .Net use for searching a pattern in a string?

I have a loop in my program that gets a line from a file. Then there is a check to whether the line contains a string

if(line.Contains("String"))
{
    //Do other stuff
}

There are over 2 million rows in the file so if I can quicken the speed by even 1/10th millisecond then this would save me over 3 minutes on each run.

So... Say a line is 1000 chars long, is it quicker to look for a short or long string, or does it not make a difference?

line.Contains("ABCDEFGHIJKLMNOPQRSTUVWXYZ");

or

line.Contains("ABCDEFG")

Thank you in advance.

+4  A: 

"Generally the algorithm gets faster as the key being searched for becomes longer"

http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

soniiic
Answered the question perfectly. Thank you.
Ash Burlaczenko
A: 

The String.Contains method returns true if and only if this string contains the specified sequence of char values.

It is basically used to check sub string within the string.

http://csharp.net-informations.com/string/csharp-string-contains.htm

Adi_aks
+1  A: 

If .NET String.Contains uses the Boyer-Moore algorithm it is faster to search a longer string.

May I suggest that if you are reading a delimited file so for example each line is a series of text fields separated by commas that you could save some searching time by not searching from position 0 in the line if you know it can't be before character 40 for example.

It is quite common to split a line using the String Spilt function on a delimiter character this returns an array of Strings. Then you would search only within the field where the value could appear. This will be quicker too.

Tony Lambert
A: 

Contains, like all the string searching methods (I think), eventually leads to the external method InternalFindNLSStringEx - whatever that does, it's fairly ungoogleable.

annakata
+9  A: 

String.Contains() follows a torturous route through System.Globalization.CompareInfo into the CLR and the NLS support sub-system where I thoroughly got lost. This contains highly optimized code with impressive perf. The only way to do it faster is through pinvoking the standard CRT function wcsstr, available in msvcrt.dll

    [DllImport("msvcrt.dll", CharSet = CharSet.Unicode)]
    private static extern IntPtr wcsstr(string toSearch, string toFind);

It returns IntPtr.Zero if the string wasn't found. I made some measurements, using String.IndexOf() instead of Contains() to test the various string comparison options. All times are in nanoseconds, searching for a string of 7 characters in a string of 60 characters. With the string not present so worst case is measured. The lowest time out of a sample of 20 was used:

StringComparison.Ordinal (same as Contains) : 245 nanoseconds
StringComparison.OrdinalIgnoreCase : 327
StringComparison.InvariantCulture : 251
StringComparison.InvariantCultureIgnoreCase : 327
StringComparison.CurrentCulture : 275
StringComparison.CurrentCultureIgnoreCase : 340
wcsstr : 213

Very impressive numbers and on par with what you'd expect these functions to require. The wcsstr() function does the same kind of ordinal comparison as String.Compare(). It is only 13% faster, a statistically insignificant improvement given that real perf is not likely to approach these measurements due to the effects of CPU cache locality. I can only conclude that you're going about as fast as you can expect. Whether the slight improvement from wcsstr is worth it is up to you.

Hans Passant
+1 for going that far down the rabbit hole.
Tj Kellie
How did you measure the nanoseconds? Did you run a thousand (or more) iterations and divided it?
Bobby
Surely he just used a good ole' stop watch? :P
Ian
@Bobby: yes, run it a million times and time it with Stopwatch. It isn't a true test since it can only measure best-case times. Actual times can easily be double what I reported, especially since it takes so little time. This code is *fast*, they did an excellent job of optimizing it.
Hans Passant
Thanks for going into so much detail. Changing to wcsstr would at most save me 80 million nanoseconds. This is less then 1/10th of a second so I think I'll pass. Cheers any way.
Ash Burlaczenko
+1  A: 

When it comes to increasing performance, you want to make sure that what you intend to improve is actually the cause of the bottleneck. I would start by profiling the performance of the various pieces in my code and finding which piece introduces the greatest bottleneck.

At first blush, I see three main components of this process that are worth profiling:

  • Loop through the entire contents of the file (this may or may not include File IO considerations, which you may wish to test separately).
  • Check String.Contains() for each line in the file.
  • "Do Stuff" when there is a match

While there are great commercial profilers out there, you can start by using a simple timer (like System.Diagnostics.Stopwatch) in your code, or, if the process is really long, simply use watch to measure time.

Measure the time of each of the following.

  1. Measure the time it takes simply to look through the entire file without doing anything else. This isolates your looping and IO code so you can see how well it performs.
  2. Next, add only the string comparison (String.Contains) and measure the total time it takes with this added.
  3. Lastly, add in the "do stuff" code and measure the total time with this added.

You can now compare the cost of adding in each component to figure out how expensive that component is and whether or not you need to improve it. For example, let's say you recorded the following times:

Test           Total Time   Cost (Difference)
=============================================
Do Nothing     0s           0s
Loop Only      100s         100s
Add Comparison 105s         5s
Add Do Stuff   130s         25s

Looking at those (fake) numbers, the most expensive part of the process is the loop and IO code, so that's where I would begin to try to increase performance. Since the "Do Stuff" part added 25 seconds to the overall execution time, I would take a look at that code next to see if there was anything I could improve. Lastly, I would look at string comparison.

Of course, the time measurements I provided are fictitious, but you can apply the same steps to your performance profiling in your application. The key is to identify the greatest costs and try to reduce those first.

Ben McCormack
Thank you for your answer, I will do that now. +1
Ash Burlaczenko
I would also benchmark the file reading. It maybe that you can speed this up enormously by reading the whole file into memory using non buffered I/O and then search through it. The disk access will be the slowest part by far.
Tony Lambert
Rather than adding Stopwatch you might want to checkout a Profiler, this will tell you which actual functions and statements are taking the time. Reading the Clock on the computer can sway the results.
Tony Lambert
@Tony great comments. The test for File IO is part of the "loop only" test in my example; I think if you change your file IO, you'd have to do two *sets* of tests, but I would agree that it's probably worth doing since IO can be very expensive. I referenced commercial profilers in my answer, and while they would certainly be the ideal solution, if you can already measure your execution time in *minutes*, a regular stopwatch would probably be OK for the simple modifications you're trying to make.
Ben McCormack
A: 

To my surprise, a compiled Regex is actually much, much faster than Contains. Of course, the cost of compiling is only worth it if you search for the same string multiple times. But if so, it's more than two times faster. Try it yourself:

static void Test()
{
  var random = new Random(10);

  var alphabet = "abcdefghijklmnopqrstuvwxyz";
  var content = new String((from x in Enumerable.Range(0, 10000000)
                            select a[random.Next(0, a.Length)]).ToArray());

  var searchString = content.Substring(5000000, 4096);

  var regex = new Regex(searchString);

  var sw = Stopwatch.StartNew();
  for (int i = 0; i < 1000; i++)
  {
    if (!regex.IsMatch(content))
    {
      throw new Exception();
    }
  }

  sw.Stop();
  Console.WriteLine("Regex: " + sw.Elapsed);
  sw.Restart();

  for (int i = 0; i < 1000; i++)
  {
    if (!content.Contains(searchString))
    {
      throw new Exception();
    }
  }
  sw.Stop();
  Console.WriteLine("String.Contains: " + sw.Elapsed);

}

I still haven't figured out how it can be this fast, looking at the compiled assembly, it's an obfuscated mess of switch statements. But it's fast, really fast.

JulianR
It's because Boyer-Moore requires a preprocessing step, the compiler regex can do that at compile time, even just reusing the regex can reuse the jump tables instead of having to recompute them.
Ben Voigt