views:

294

answers:

6

I'm generating regular expressions dynamically by running through some xml structure and building up the statement as I shoot through its node types. I'm using this regular expression as part of a Layout type that I defined. I then parse through a text file that has an Id in the beginning of each line. This id points me to a specific layout. I then try to match the data in that row against its regex.

Sounds fine and dandy right? The only problem is it is matching strings extremely slow. I have them set as compiled to try and speed things up a bit, but to no avail. What is baffling is that these expressions aren't that complex. I am by no means a RegEx guru, but I know a decent amount about them to get things going well.

Here is the code that generates the expressions...

StringBuilder sb = new StringBuilder();
//get layout id and memberkey in there...
sb.Append(@"^([0-9]+)[ \t]{1,2}([0-9]+)"); 
foreach (ColumnDef c in columns)
{
    sb.Append(@"[ \t]{1,2}");
    switch (c.Variable.PrimType)
    {
        case PrimitiveType.BIT:
            sb.Append("(0|1)");
            break;
        case PrimitiveType.DATE:
            sb.Append(@"([0-9]{2}/[0-9]{2}/[0-9]{4})");
            break;
        case PrimitiveType.FLOAT:
            sb.Append(@"([-+]?[0-9]*\.?[0-9]+)");
            break;
        case PrimitiveType.INTEGER:
            sb.Append(@"([0-9]+)");
            break;
        case PrimitiveType.STRING:
            sb.Append(@"([a-zA-Z0-9]*)");
            break;
    }
}
sb.Append("$");
_pattern = new Regex(sb.ToString(), RegexOptions.Compiled);

The actual slow part...

public System.Text.RegularExpressions.Match Match(string input)
{
    if (input == null)
       throw new ArgumentNullException("input");

    return _pattern.Match(input);
}

A typical "_pattern" may have about 40-50 columns. I'll save from pasting the entire pattern. I try to group each case so that I can enumerate over each case in the Match object later on.

Any tips or modifications that could drastically help? Or is this running slowly to be expected?

EDIT FOR CLARITY: Sorry, I don't think I was clear enough the first time around.

I use an xml file to generate regex's for a specific layout. I then run through a file for a data import. I need to make sure that each line in the file matches the pattern it says its supposed to be. So, patterns could be checked against multiple times, possible thousands.

+1  A: 

Having a potential of 50 match groups in a single expression by default is going to be a bit slow. I would do a few things to see if you can pin down the performance setback.

  1. Start by trying a hard coded, vs dynamic comparison and see if you have any performance benefit.
  2. Look at your requirements and see if there is any way you can reduce the number of groupings that you need to evaluate
  3. Use a profiler tool if needed, such as Ants Profiler to see the location of the slowdown.
Mitchel Sellers
+2  A: 

Well. Building the pattern using a StringBuilder will save a few cycles, compared to concatenating them.

An optimization on this that is drastic (can be visibly measured) is most likely going to be doing this through some other method.

Regular expressions are slow ... powerful but slow. Parsing through a text-file and then comparing using regular expressions just to retrieve the right bits of data is not going to be very quick (dependent on the host computer and size of text file).

Perhaps storing the data in some other method rather than a large text file would be more efficient to parse (use XML for that as well?), or perhaps a comma separated list.

Ali Lown
its for the data import actually. Which is why I'm using regex. To make sure the imported data fits the given format specified by the user.
Nicholas Mancuso
When you say it is slow, how slow do you mean? How large are these files? How many patterns are you comparing against? Presumably there are some fairly common patterns that could be done using if's and substr'ings. It would probably be more efficient (although less neat) if rather than using RegEx for common patterns, you hard code them into the app yourself.By the way, is this functionality being used regularly, if so, there must be a more efficent way to store this data wherever it is going directly (database?), performing validation then, on the input.
Ali Lown
+2  A: 

Regular expression are expensive to create and are even more expensive if you compile them. So the problem is that you are creating many regular expressions but use them only once.

You should cache them for reuse and relly don't compile them if you don't want to use them really often. I have never meassured that, but I could imagine that you will have to use a simple regular expression well over 100 times to outweight the cost of the compilation.

Performance test

  • Regex: "^(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+(?:[a-z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)$"

  • Input: "www.stackoverflow.com"

  • Results in milliseconds per iteration

    • one regex, compiled, 10,000 iterations: 0.0018 ms
    • one regex, not compiled, 10,000 iterations: 0.0021 ms
    • one regex per iteration, not compiled, 10,000 iterations: 0.0287 ms
    • one regex per iteration, compiled, 10,000 iterations: 4.8144 ms

Note that even after 10,000 iterations the compiled and uncompiled regex are still very close together comparing their performance. With increasing number of iterations the compiled regex performs better.

  • one regex, compiled, 1,000,000 iterations: 0.00137 ms
  • one regex, not compiled, 1,000,000 iterations: 0.00225 ms
Daniel Brückner
Perhaps I needed to explain a bit better. I'm not using them only once. Any time something during the parsed file points to a specific layout. I check that the line matches a pattern for that layout.
Nicholas Mancuso
Precompiling, even for single usage, in my testing yields consistently better regex performance. It's not much for single use, but there isn't a performance hit.
patjbs
So you create one regular expression per layout and use this one when ever you find a coresponding line, right?
Daniel Brückner
That is Correct.
Nicholas Mancuso
So after I tested ... there is definitly a performance hit compiling a regex for a single use - it's more than 100 times slower in my test.
Daniel Brückner
+4  A: 

Some performance thoughts:

  • use [01] instead of (0|1)
  • use non-capturing groups (?:expr) instead of capturing groups (if you really need grouping)


Edit   As it seems that your values are separated by whitespace, why don’t you split it up there?

Gumbo
Ya, It may actually be more beneficial to keep a list of smaller regex's per layout, and split the string based on '\t' then go down matching each.
Nicholas Mancuso
+7  A: 

You are parsing a 50 column CSV file (that uses tabs) with regex?

You should just remove duplicate tabs, then split the text on \t. Now you have all of your columns in an array. You can use your ColumnDef object collection to tell you what each column is.

Edit: Once you have things split up, you could optionally use regex to verify each value, this should be much faster than using the giant single regex.

Edit2: You also get an additional benefit of knowing exactly what column(s) is badly formated and you can produce an error like "Sytax error in column 30 on line 12, expected date format."

JasonMArcher
I'm beginning to think this will be much faster and simpler as well.
Nicholas Mancuso
This is probably the best solution presented so far. I use dozens of complex regexes on a daily basis (processing publishing text and XML). IME, once your regexes reach a certain "critical mass" of complexity, performance goes down the tubes. Splitting this problem up into smaller chunks will be your way around this bottleneck.
patjbs
+1  A: 

I would just build a lexer by hand.

In this case it looks like you have a bunch of fields seperated by tabs, with a record terminated by a new line. The XML file appears to describe the sequence of columns, and their types.

Writing code to recognize each case by hand is probably 5-10 lines of code at the worst case.

You would then want to simply generate an arraay of PrimitiveType[] values from the xml file, and then call the "GetValues" function below.

This should allow you to make a single pass through the input stream, which should give a big boost over using regexes.

You'll need to supply the "ScanXYZ" methods your self. They should be easy to write. It's best to implement them w/out using regexes.

public IEnumerable<object[]> GetValues(TextReader reader, PrimitiveType[] schema)
{
   while (reader.Peek() > 0)
   {
       var values = new object[schema.Length];
       for (int i = 0; i < schema.Length; ++i)
       {
           switch (schema[i])
           {
               case PrimitiveType.BIT:
                   values[i] = ScanBit(reader);
                   break;
               case PrimitiveType.DATE:
                   values[i] = ScanDate(reader);
                   break;
               case PrimitiveType.FLOAT:
                   values[i] = ScanFloat(reader);
                   break;
               case PrimitiveType.INTEGER:
                   values[i] = ScanInt(reader);
                   break;
               case PrimitiveType.STRING:
                   values[i] = ScanString(reader);
                   break;
           }
       }

       EatTabs(reader);

       if (reader.Peek() == '\n')
       {
            break;
       }

   if (reader.Peek() == '\n')
   {
       reader.Read();
   }
   else if (reader.Peek() >= 0)
   {
       throw new Exception("Extra junk detected!");
   }

   yield return values;

   }

   reader.Read();
}
Scott Wisniewski