tags:

views:

134

answers:

3

I've got a working regex that scans a chunk of text for a list of keywords defined in a db. I dynamically create my regex from the db to get this:

\b(?:keywords|from|database|with|esc\@ped|characters|\@ss|gr\@ss)\b

Notice that special characters are escaped. This works for the vast majority of cases, EXCEPT where the first character of the keyword is a regex special character like @ or $. So in the above example, @ss will not be matched, but gr@ss and esc@ped will.

Any ideas how to get this regex to work for these special cases? I've tried both with and without escaping the special characters in the regex string, but to no avail.

Thanks in advance,

David

+2  A: 

When you get the keywords from the database, escape them with Regex.Escape before creating the Regex string.

hjb417
That soes not escape @
asgerhallas
Really good call asgerhallas. Why is the @ being escaped in the 1st place?
hjb417
I was escaping a list of characters that seemed to be causing a problem - I also tried without escaping them. Regex.Escape escapes the reserved metacharacters only - but the regex still doesn't match strings beginning with characters like @
David Conlisk
+1  A: 

The @ does not denote a word boundary.

Use: (\s|^)(?:keywords|from|database|with|esc@ped|characters|@ss|gr@ss)(\s|$)

Tested with the following program:

    static void Main(string[] args)
    {
        string pattern = "(\\s|^)(?:keywords|from|database|with|esc@ped|characters|@ss|gr@ss)(\\s|$)"
        var matches = Regex.Matches("@ss is gr@ss is esc@ped keywordsnospace keywords", pattern);
        foreach (Match match in matches)
        {
            Console.WriteLine(match.Groups[2]);
        }
    }

Giving the result:

@ss

gr@ss

esc@ped

keywords

asgerhallas
That doesn't seem to work. It doesn't match "keywords" or "gr@ss" or "@ss". Any other ideas?
David Conlisk
Hmm. I just tryed that. It worked. Two seconds, I'll try again.
asgerhallas
Sorry, it's: (\\s|^)(?:keywords|from|database|with|esc@ped|characters|@ss|gr@ss)Updating the answer.
asgerhallas
It still has one problem though, no boundary needed at the end now. Working on it :)
asgerhallas
Tim Pietzcker's answer is the correct one. I was not fast enough :(
asgerhallas
+4  A: 
new Regex(@"(?<=^|\W)(?:keywords|from|database|with|esc@ped|characters|@ss|gr@ss)(?=\W|$)")

will match. It checks whether there is a non-word character (or beginning/end of string) before/after the keyword to be matched. I chose \W over \s because of punctuation and other non-word characters that might constitute a word boundary.

Edit: Even better (thanks to Alan Moore! - both versions will produce the same results):

new Regex(@"(?<!\w)(?:keywords|from|database|with|esc@ped|characters|@ss|gr@ss)(?!\w)")

Both will fail to match @ass in l@ss which is probably what you want.

Tim Pietzcker
Yes that seems to match the same as @"(\b|^)(?:keywords|from|database|with|esc@ped|characters|@ss|gr@ss)(\b|$)"
asgerhallas
No, it doesn't. Your regex matches `@ss` only at the beginning of the string.
Tim Pietzcker
I'm sorry. You're right. Your answers is the correct one. Dismiss mine please :)
asgerhallas
That's perfect, thanks very much Tim! That one had us stumped for quite a while.
David Conlisk
Can also be written as `(?<!\w)...(?!\w)` (which I think is more expressive as well as more concise).
Alan Moore
Absolutely! Thanks - I have included your version above since it really is much better.
Tim Pietzcker