views:

297

answers:

4

Hello,

I have List collection with around 35,000 strings

Typical string looks like this:

"<i>füüs</i>ampri tähis;lüh ld-st<i>anno</i>, aastal;<i>maj</i> lüh pr-st<i>argent</i>, raha (kursisedelitel)"

Basically this string contains bunch of words in Estonian :)

I need to allow user to perform RegExp search on 35,000 strings

If I perform search using /ab.*/ expression, then search takes less than a second

If I perform search using /.*ab/ expression, then search takes around 10 seconds

My question is: How can I make second search faster (less then 1.5 seconds)?

Thank You very much

+2  A: 

Use compiled regular expressions for better performance

http://en.csharp-online.net/CSharp_Regular_Expression_Recipes—Compiling_Regular_Expressions

copy paste the full url, looks like there is rendering problem with this link.

Binoj Antony
Wow, it solved part of my problem. Now same search takes 2.5 seconds
Daniil Harik
A: 

I got this crazy idea that you could also store the strings in reverse order and search those strings with /ba.*/ if the user enter /.*ab/.

Jonas Elfström
But how long would it take to reverse the 35,000 target strings? ;)
Alan Moore
A: 

Your second expression will match 'ab' and all characters before it (except the new line). You can try searching only /ab/, get the index of the match and as a result concat the part of the string preceeding the match with match.

Marqus
+3  A: 

It’s how regular expressions are processed that makes them perform so different. To explain that based on your examples:

  • /.*ab/   This expression consists on two sub-expressions, the .* and the literal ab. This would be processed as follows: In the normal greedy mode, where every quantor and thus the match is expanded to its maximum, the .* will first match the whole string. Then it will try to match the following ab. But as it is not possible (we’re already at the end of the string) backtracking will be used to find the last point where both sub-expressions match. So the match of .* is reduced by one character and again the ab is tested. If the whole expression cannot be matched, the match of .* again will be reduced by one character until the whole expression is matched. In the worst case there is no ab in the string and the algorithm will do n+1 backtracks and additional tests for ab until it can determine that a match is impossible.

  • /ab.*/   This expression consists of two sub-expressions too. But here the order is changed, first the literal ab and than the .*. This is processed as follows: The algorithm first tries to find the literal ab by comparing one character by another. If there is a match, it tries to find a match for .* that is obvious easy.

The main difference between those two regular expressions is, that the second has the static part at the beginning and the variable part at the end. This makes no backtracking necessary.

Try some regular expression analysis tool such as RegexBuddy to see the difference visually.

Gumbo
Great explanation, Gumbo!
Cerebrus