tags:

views:

946

answers:

2

I have a regex to find the nth occurrence of a character in a string, here's the code:

 public static int NthIndexOf(this string target, string value, int n)
    {
        Match m = Regex.Match(target, "((" + value + ").*?){" + n + "}");

        if (m.Success)
        {
            return m.Groups[2].Captures[n - 1].Index;
        }
        else
        {
             return -1;
        }
    }

Now, I have 1594 entries in this string, with 1593 semicolons. If I write:

tempstring.NthIndexOf(";", 1593) 

The answer comes back immediately and correctly. If I give it anything over 1594 it hangs. Does anyone know how to fix this?

Test Case

 string holder = "test;test2;test3";
        string test = "";
        for (int i = 0; i < 600; i++)
        {
            test += holder;
        }
        int index = test.NthIndexOf(";", 2000);

This takes a very long time. Change 600 to 6 and it is very fast. Make 2000 to 1700 and it is very fast as well.

Why is my regular expression so slow?

+9  A: 

If you're really only looking for character repetitions, and not string repetitions, then you should be able to replace you method with something simple like

public static int NthIndexOf(this string target, char testChar, int n)
{
   int count = 0;

   for(int i=0; i<target.Length; i++)
   {
      if(target[i] == testChar)
      {
         count++;
         if(count == n) return i;  
      }
   }

   return -1;
}

and use that. It should have far fewer limitations.

As for why your original regex is going slow, here's what I suspect:

  • For your fast case, it's working because it can find a match on it's first pass through (with each group matching exactly one character)
  • For the slow case is because it can't find a match (and won't ever find one, because there aren't enough semicolons to satisfy the regex), but it recursively tries every possible way to break up the string (which is a really big operation)
Daniel LeCheminant
+1 - don't ever use regex when you're doing simple string operations. Use builtin functions. This one might be even faster if it used something like (I guess .net has some internal .indexof):while (n-- > 0) { index = target.IndexOf(textChar, index+1)); if (index==-1) return -1; }return index;
viraptor
@viraptor: +1 that comment. .NET of course has IndexOf().
Tomalak
I expect internal optimizations here... Sure enough, Reflector reveals: public extern int IndexOf(char value, int startIndex, int count);
Tomalak
well, initially this was meant to be more general purpose. I do agree with you however.
Steve
+2  A: 

Try to use a more distinct and efficient regular expression:

"^(?:[^" + value + "]*" + value + "){" + (n - 1) + "}([^" + value + "]*)

This will build the following regular expression for tempstring.NthIndexOf(";", 1593):

^(?:[^;]*;){1592}([^;]*)

But this will only work for single characters as separator.

Another approach would be to step through each character and count the occurences of the character you were looking for.

Gumbo
Note that that only works if "value" is a single character...
David Zaslavsky
@David: Thanks for your remark.
Gumbo
+1 for providing a faster regex. Does .NET regex support atomic grouping (?>...)? This could help improve performance. I'm not sure if possessive quantifiers ({1592}+) are supported, but they would speed up the process as well.
Tomalak