tags:

views:

194

answers:

3

I am using a simple regular expression (in C#) to find a whole word within a block of text.

The word may appear at the beginning, end or in the middle of a the text or sentence with in the text.

The expression I have been using \bword\b has been working fine however if the word included a special character (that has been escaped) it no longer works. The boundary is essential so that we do not pick up words such as vb.net as a match for .net.

Two examples that fail are:

\bc\#\b

\b\.net\b

I can change the word boundary to a list of other checks such as not at the start non-space etc. however this is complex and can be slow if used on a large number of words.

A: 

It isn't a match because the escaped characters (# and .) are not word characters, so there isn't a word-boundary (\b) between this and the whitespace/etc. Perhaps look for whitespace/bol/eol/etc specifically?

Marc Gravell
I have tried this and it does work but is considerably slower when running this for a few hundred words against a few hundred paragraphs of text
John
+3  A: 

The \b matches the boundary between word characters and non-word characters, but won't match the boundary between two non-word characters.

For example, in the case of C# there's a boundary between the C (a word character) and the # (a non-word character) but not between the # and whatever comes after it (space, punctuation, end-of-string etc).

You can workaround this problem as follows:

  • Use (?:^|\W) instead of \b at the beginning of the expression.
    For example, (?:^|\W)\.NET\b
    This will match either the start-of-string or a non-word character before the . character.
  • Use (?:\W|$) instead of \b at the end of the expression.
    For example, \bC#(?:\W|$)
    This will match either a non-word character or the end-of-string after the # character.
LukeH
This worked however as it was considerably slower on a few thousand calls I only run this if something needs to be escaped which is rare. Thanks
John
@John: You only need to use these workarounds if the non-word character is at the start or end of your seach term. If the term contains only word characters, or if the non-word characters are buried somewhere in the middle of the term (for example, `f@t`) then using `\b` will be fine.
LukeH
+2  A: 

I would suggest negative lookarounds:

(?<!\w)c#(?!\w)

(?<!\w)\.net(?!\w)

That should be quicker than matching anchors or non-word characters, like (?:^|\W), plus you don't have to deal with the extraneous characters when it's the \W that matches.

Alan Moore
Good answer, better than my solution!
LukeH