views:

534

answers:

4

Hi,

I am having a very strange problem. I have a very large regular expression searching for certain words in some text (RegEx looks something like this: (?i)\b(a|b|c|d...)\b; and so on where a, b, c, d, represent words). Anyway, I put it in a pre compiled assembly to speed things up a bit, however the problem is that pre compiled regex does not work the same way as a non compiled version of the same regex... o_0

For example if the regex is: (?i)\b(he|desk)\b and I pass "helpdesk" through it the pre compiled version returns "lp" so the words he and desk gets striped out as if the boundary condition is not working at all, however if I do use exactly the same regular expression a non pre compiled version it works just fine... Does anyone know if I may be missing anything?

Thanks

(Sorry using VB.Net and C#)

A: 

You've given two different languages there. So maybe there's some interaction.

In any case I think some short but complete test programs might be in order - try and reproduce the problem in some independent test code to make it easier to reason about.

Maybe it would be more efficient to do the search without regular expressions?

Douglas Leeder
A: 

I have written two test apps, one in C# and one in VB.Net both exhibit the same behaviour. It seems that pre compiled version of the regular expression I am using ignores boundary conditions on some of the words. I have tried not using regex but I have a list of 3000+ words in the expression and after some tests it appears to be the best solution, the only thing is I do not want it in my main code and would prefer to have it in a pre compiled assembly...

Thanks

Serge
Regular expressions are *not* a feature of C# or VB.NET. They are part of the .NET framework and as such C# and VB.NET share the same implementation.
Brian Rasmussen
I understand that, though I found that some of the characters gets interpreted differently, possibly due to encoding. So to be sure I tried them both... Any idea if I can use the regex functionality in C++?
Serge
If you prefix the string with @ in C# the characters will be encoded in the same way as VB.Net. So you should use Regex(@"(?i)\b(he|desk)\b") instead of Regex("(?i)\b(he|desk)\b") as the \b will get encoded as a backspace in the second example.
Martin Brown
+1  A: 

Since you are searching for whole words, how about searching for \w+, and checking if the word is in a collection. A hash-based set or a hash-map would work well here. This approach would make it easier to update the list if the need should arise.

MizardX
I actually did not think of using `\w+` but the only problem is that I am already suffering from OOM bugs and really would like to stay away from keeping anything in memory as big as that...
Serge
A: 

Guys, I am fifth day in to this and still have no solution... I have tried lots of different stuff but it all boils down to the fact that the same RegEx string when compiled works differently to pre compiled one. So the same RegEx string when compiled to an assembly using Regex.CompileToAssembly doesn't work unlike non compiled version which is just fine. I am lost and I just decided not to bother with compiling and just use it in code straight from a string constant... I haven't noticed massive performance loss, if any. So the moral of the story is that be ware of compiled RegEx as it may not work the same way as non compiled one. So watch out and test rigorously.

Serge
Trouble is I have tried to repeat this issue with the code and regex you gave in the original post and have been unable to reproduce the issue. As such I guess it is an issue being caused by the length of the 300 word RegEx.
Martin Brown