views:

304

answers:

3

This is just a question to satisfy my curiosity. But to me it is interesting.

I wrote this little simple benchmark. It calls 3 variants of Regexp execution in a random order a few thousand times:

Basically, I use the same pattern but in different ways.

  1. Your ordinary way without any RegexOptions. Starting with .NET 2.0 these do not get cached. But should be "cached" because it is held in a pretty global scope and not reset.

  2. With RegexOptions.Compiled

  3. With a call to the static Regex.Match(pattern, input) which does get cached in .NET 2.0

Here is the code:

static List<string> Strings = new List<string>();        
static string pattern = ".*_([0-9]+)\\.([^\\.])$";

static Regex Rex = new Regex(pattern);
static Regex RexCompiled = new Regex(pattern, RegexOptions.Compiled);

static Random Rand = new Random(123);

static Stopwatch S1 = new Stopwatch();
static Stopwatch S2 = new Stopwatch();
static Stopwatch S3 = new Stopwatch();

static void Main()
{
  int k = 0;
  int c = 0;
  int c1 = 0;
  int c2 = 0;
  int c3 = 0;

  for (int i = 0; i < 50; i++)
  {
    Strings.Add("file_"  + Rand.Next().ToString() + ".ext");
  }
  int m = 10000;
  for (int j = 0; j < m; j++)
  {
    c = Rand.Next(1, 4);

    if (c == 1)
    {
      c1++;
      k = 0;
      S1.Start();
      foreach (var item in Strings)
      {
        var m1 = Rex.Match(item);
        if (m1.Success) { k++; };
      }
      S1.Stop();
    }
    else if (c == 2)
    {
      c2++;
      k = 0;
      S2.Start();
      foreach (var item in Strings)
      {
        var m2 = RexCompiled.Match(item);
        if (m2.Success) { k++; };
      }
      S2.Stop();
    }
    else if (c == 3)
    {
      c3++;
      k = 0;
      S3.Start();
      foreach (var item in Strings)
      {
        var m3 = Regex.Match(item, pattern);
        if (m3.Success) { k++; };
      }
      S3.Stop();
    }
  }

  Console.WriteLine("c: {0}", c1);
  Console.WriteLine("Total milliseconds: " + (S1.Elapsed.TotalMilliseconds).ToString());
  Console.WriteLine("Adjusted milliseconds: " + (S1.Elapsed.TotalMilliseconds).ToString());

  Console.WriteLine("c: {0}", c2);
  Console.WriteLine("Total milliseconds: " + (S2.Elapsed.TotalMilliseconds).ToString());
  Console.WriteLine("Adjusted milliseconds: " + (S2.Elapsed.TotalMilliseconds*((float)c2/(float)c1)).ToString());

  Console.WriteLine("c: {0}", c3);
  Console.WriteLine("Total milliseconds: " + (S3.Elapsed.TotalMilliseconds).ToString());
  Console.WriteLine("Adjusted milliseconds: " + (S3.Elapsed.TotalMilliseconds*((float)c3/(float)c1)).ToString());
}

Everytime I call it the result is along the lines of:

    Not compiled and not automatically cached:
    Total milliseconds: 6185,2704
    Adjusted milliseconds: 6185,2704

    Compiled and not automatically cached:
    Total milliseconds: 2562,2519
    Adjusted milliseconds: 2551,56949184038

    Not compiled and automatically cached:
    Total milliseconds: 2378,823
    Adjusted milliseconds: 2336,3187176891

So there you have it. Not much, but about 7-8% difference.

It is not the only mystery. I cannot explain why the first way would be that much slower because it is never re-evaluated but held in a global static variable.

By the way, this is on .Net 3.5 and Mono 2.2 which behave exactly the same. On Windows.

So, any ideas, why the compiled variant would even fall behind?

EDIT1:

After fixing the code the results now look like this:

    Not compiled and not automatically cached:
    Total milliseconds: 6456,5711
    Adjusted milliseconds: 6456,5711

    Compiled and not automatically cached:
    Total milliseconds: 2668,9028
    Adjusted milliseconds: 2657,77574842168

    Not compiled and automatically cached:
    Total milliseconds: 6637,5472
    Adjusted milliseconds: 6518,94897724836

Which pretty much obsoletes all of the other questions as well.

Thanks for the answers.

+3  A: 

I noticed similar behavior. I also wondered why the compiled version would be slower, but noticed that above a certain number of calls, the compiled version is faster. So I dug into Reflector a little, and I noticed that for a compiled Regex, there's still a little setup that is performed on first call (specifically, creating an instance of the appropriate RegexRunner object).

In my test, I found that if I moved both the constructor and an initial throw-away call to the regex outside the timer start, the compiled regex won no matter how many iterations I ran.


Incidentally, the caching that the framework is doing when using static Regex methods is an optimization that's only needed when using static Regex methods. This is because every call to a static Regex method creates a new Regex object. In the Regex class's constructor it must parse the pattern. The caching allows subsequent calls of static Regex methods to reuse the RegexTree parsed from the first call, thereby avoiding the parsing step.

When you use instance methods on a single Regex object, then this is not an issue. The parsing is still only performed one time (when you create the object). In addition, you get to avoid running all the other code in the constructor, as well as the heap allocation (and subsequent garbage collection).

Martin Brown noticed that you reversed the arguments to your static Regex call (good catch, Martin). I think you'll find that if you fix that, the instance (not-compiled) regex will beat the static calls every time. You should also find that, given my findings above, the compiled instance will beat the not-compiled one, too.

BUT: You should really read Jeff Atwood's post on compiled regexes before you go blindly applying that option to every regex you create.

P Daddy
Thank you for your explanations.The initial step does not seem to incur to much cost in my case (see the new results).I read Jeff Atwood's post before posting this. So I am aware of the drawbacks. In my case the Compile option would help although not that much in the standard use case.
A: 

If you constantly match the same string using the same pattern, that may explain why a cached version is slightly faster than a compiled version.

Vincent
+4  A: 

In the Regex.Match version you are looking for the input in the pattern. Try swapping the parameters around.

var m3 = Regex.Match(pattern, item); // Wrong
var m3 = Regex.Match(item, pattern); // Correct
Martin Brown
Ah.:-)Thanks. I will update my post with the new results in a minute.