ansaurus

Question

Dynamically Built Regular Expressions are running extremely slow!

Answer 1

+1 A:

Having a potential of 50 match groups in a single expression by default is going to be a bit slow. I would do a few things to see if you can pin down the performance setback.

Start by trying a hard coded, vs dynamic comparison and see if you have any performance benefit.
Look at your requirements and see if there is any way you can reduce the number of groupings that you need to evaluate
Use a profiler tool if needed, such as Ants Profiler to see the location of the slowdown.

Mitchel Sellers 2009-04-29 19:49:34

Answer 2

+2 A:

Well. Building the pattern using a StringBuilder will save a few cycles, compared to concatenating them.

An optimization on this that is drastic (can be visibly measured) is most likely going to be doing this through some other method.

Regular expressions are slow ... powerful but slow. Parsing through a text-file and then comparing using regular expressions just to retrieve the right bits of data is not going to be very quick (dependent on the host computer and size of text file).

Perhaps storing the data in some other method rather than a large text file would be more efficient to parse (use XML for that as well?), or perhaps a comma separated list.

Ali Lown 2009-04-29 19:51:11

its for the data import actually. Which is why I'm using regex. To make sure the imported data fits the given format specified by the user.

Nicholas Mancuso 2009-04-29 19:52:58

When you say it is slow, how slow do you mean? How large are these files? How many patterns are you comparing against? Presumably there are some fairly common patterns that could be done using if's and substr'ings. It would probably be more efficient (although less neat) if rather than using RegEx for common patterns, you hard code them into the app yourself.By the way, is this functionality being used regularly, if so, there must be a more efficent way to store this data wherever it is going directly (database?), performing validation then, on the input.

Ali Lown 2009-04-29 20:17:29

Answer 3

+2 A:

Regular expression are expensive to create and are even more expensive if you compile them. So the problem is that you are creating many regular expressions but use them only once.

You should cache them for reuse and relly don't compile them if you don't want to use them really often. I have never meassured that, but I could imagine that you will have to use a simple regular expression well over 100 times to outweight the cost of the compilation.

Performance test

Regex: "^(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+(?:[a-z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)$"
Input: "www.stackoverflow.com"
Results in milliseconds per iteration
- one regex, compiled, 10,000 iterations: 0.0018 ms
- one regex, not compiled, 10,000 iterations: 0.0021 ms
- one regex per iteration, not compiled, 10,000 iterations: 0.0287 ms
- one regex per iteration, compiled, 10,000 iterations: 4.8144 ms

Note that even after 10,000 iterations the compiled and uncompiled regex are still very close together comparing their performance. With increasing number of iterations the compiled regex performs better.

one regex, compiled, 1,000,000 iterations: 0.00137 ms
one regex, not compiled, 1,000,000 iterations: 0.00225 ms

Daniel Brückner 2009-04-29 19:52:07

Perhaps I needed to explain a bit better. I'm not using them only once. Any time something during the parsed file points to a specific layout. I check that the line matches a pattern for that layout.

Nicholas Mancuso 2009-04-29 19:55:47

Precompiling, even for single usage, in my testing yields consistently better regex performance. It's not much for single use, but there isn't a performance hit.

patjbs 2009-04-29 19:58:37

So you create one regular expression per layout and use this one when ever you find a coresponding line, right?

Daniel Brückner 2009-04-29 20:00:11

That is Correct.

Nicholas Mancuso 2009-04-29 20:05:21

So after I tested ... there is definitly a performance hit compiling a regex for a single use - it's more than 100 times slower in my test.

Daniel Brückner 2009-04-29 20:41:23

Answer 4

+4 A:

Some performance thoughts:

use [01] instead of (0|1)
use non-capturing groups (?:expr) instead of capturing groups (if you really need grouping)

Edit As it seems that your values are separated by whitespace, why don’t you split it up there?

Gumbo 2009-04-29 19:52:46

Ya, It may actually be more beneficial to keep a list of smaller regex's per layout, and split the string based on '\t' then go down matching each.

Nicholas Mancuso 2009-04-29 20:06:17

Answer 5

+7 A:

You are parsing a 50 column CSV file (that uses tabs) with regex?

You should just remove duplicate tabs, then split the text on \t. Now you have all of your columns in an array. You can use your ColumnDef object collection to tell you what each column is.

Edit: Once you have things split up, you could optionally use regex to verify each value, this should be much faster than using the giant single regex.

Edit2: You also get an additional benefit of knowing exactly what column(s) is badly formated and you can produce an error like "Sytax error in column 30 on line 12, expected date format."

JasonMArcher 2009-04-29 20:06:19

I'm beginning to think this will be much faster and simpler as well.

Nicholas Mancuso 2009-04-29 20:07:48

This is probably the best solution presented so far. I use dozens of complex regexes on a daily basis (processing publishing text and XML). IME, once your regexes reach a certain "critical mass" of complexity, performance goes down the tubes. Splitting this problem up into smaller chunks will be your way around this bottleneck.

patjbs 2009-04-29 20:14:31

Answer 6

+1 A:

I would just build a lexer by hand.

In this case it looks like you have a bunch of fields seperated by tabs, with a record terminated by a new line. The XML file appears to describe the sequence of columns, and their types.

Writing code to recognize each case by hand is probably 5-10 lines of code at the worst case.

You would then want to simply generate an arraay of PrimitiveType[] values from the xml file, and then call the "GetValues" function below.

This should allow you to make a single pass through the input stream, which should give a big boost over using regexes.

You'll need to supply the "ScanXYZ" methods your self. They should be easy to write. It's best to implement them w/out using regexes.

public IEnumerable<object[]> GetValues(TextReader reader, PrimitiveType[] schema)
{
   while (reader.Peek() > 0)
   {
       var values = new object[schema.Length];
       for (int i = 0; i < schema.Length; ++i)
       {
           switch (schema[i])
           {
               case PrimitiveType.BIT:
                   values[i] = ScanBit(reader);
                   break;
               case PrimitiveType.DATE:
                   values[i] = ScanDate(reader);
                   break;
               case PrimitiveType.FLOAT:
                   values[i] = ScanFloat(reader);
                   break;
               case PrimitiveType.INTEGER:
                   values[i] = ScanInt(reader);
                   break;
               case PrimitiveType.STRING:
                   values[i] = ScanString(reader);
                   break;
           }
       }

       EatTabs(reader);

       if (reader.Peek() == '\n')
       {
            break;
       }

   if (reader.Peek() == '\n')
   {
       reader.Read();
   }
   else if (reader.Peek() >= 0)
   {
       throw new Exception("Extra junk detected!");
   }

   yield return values;

   }

   reader.Read();
}

Scott Wisniewski 2009-04-29 20:12:13

ansaurus

tags:

views:

answers:

Dynamically Built Regular Expressions are running extremely slow!

related questions