ansaurus

Question

.NET RegExp engine search performance optimization

Answer 1

+2 A:

Use compiled regular expressions for better performance

http://en.csharp-online.net/CSharp_Regular_Expression_Recipes—Compiling_Regular_Expressions

copy paste the full url, looks like there is rendering problem with this link.

Binoj Antony 2009-04-29 11:04:19

Wow, it solved part of my problem. Now same search takes 2.5 seconds

Daniil Harik 2009-04-29 11:12:01

Answer 2

A:

I got this crazy idea that you could also store the strings in reverse order and search those strings with /ba.*/ if the user enter /.*ab/.

Jonas Elfström 2009-04-29 11:04:56

But how long would it take to reverse the 35,000 target strings? ;)

Alan Moore 2009-05-01 04:07:43

Answer 3

A:

Your second expression will match 'ab' and all characters before it (except the new line). You can try searching only /ab/, get the index of the match and as a result concat the part of the string preceeding the match with match.

Marqus 2009-04-29 11:16:52

Answer 4

+3 A:

It’s how regular expressions are processed that makes them perform so different. To explain that based on your examples:

/.*ab/ This expression consists on two sub-expressions, the .* and the literal ab. This would be processed as follows: In the normal greedy mode, where every quantor and thus the match is expanded to its maximum, the .* will first match the whole string. Then it will try to match the following ab. But as it is not possible (we’re already at the end of the string) backtracking will be used to find the last point where both sub-expressions match. So the match of .* is reduced by one character and again the ab is tested. If the whole expression cannot be matched, the match of .* again will be reduced by one character until the whole expression is matched. In the worst case there is no ab in the string and the algorithm will do n+1 backtracks and additional tests for ab until it can determine that a match is impossible.
/ab.*/ This expression consists of two sub-expressions too. But here the order is changed, first the literal ab and than the .*. This is processed as follows: The algorithm first tries to find the literal ab by comparing one character by another. If there is a match, it tries to find a match for .* that is obvious easy.

The main difference between those two regular expressions is, that the second has the static part at the beginning and the variable part at the end. This makes no backtracking necessary.

Try some regular expression analysis tool such as RegexBuddy to see the difference visually.

Gumbo 2009-04-29 11:23:25

Great explanation, Gumbo!

Cerebrus 2009-04-29 11:38:54

ansaurus

tags:

views:

answers:

.NET RegExp engine search performance optimization

related questions