ansaurus

Question

Should I be able to quote a leading or trailing dollar sign ($) inside a word boundary in Java Regular Expression?

Answer 1

+1 A:

This is because \b matches word boundaries. And the position immediately in before or after a $ character does not necessarily count as a word boundary.

A word boundary is the position between \w and \W, and $ is not part of \w. On the example of the string "bla$", word boundaries are:

" b l a $ "
 ^----------- here

" b l a $ "
       ^----- here

" b l a $ "
         ^--- but not here

Tomalak 2010-07-23 16:10:16

Answer 2

A:

Tomalak nailed it - it's about word boundary matching. I had figured it out and deleted the question, but Will's advice to keep open for others is sound.

The \b was, in fact, the culprit.

One conclusion could be that for anything but the most rudimentary (i.e. ASCII) uses, the built-in convenience expressions from Java are effectively useless. Eg. \w only matches ASCII characters, \b is based on that, etc.

FWIW, my RegExp ended up being:

   (?:^|[\p{P}\p{Z}])(\QThe $earch Term\E)(?:[\p{P}\p{Z}]|$)

where The $earch Term is the text I'm trying to match.

The \p{} are the Unicode categories. Basically, I'm breaking my word on any character in the Punctuation (P) or Separator (Z) Unicode character categories. As well, the start and end of the input are respected (with ^ and $) and the boundary markers are tagged as non-matching groups (the (?:...) bits) while the actual search term is quoted with \Q and \E & placed in a matching group.

Intellectual Tortoise 2010-08-13 18:58:32

`\b` works fine; you were just trying to use it in the wrong place. And `\b` *is* Unicode savvy; it uses `Character.isLetterOrDigit()`, not `\w`, to decide what's a word character and what isn't.

Alan Moore 2010-08-14 01:46:44

ansaurus

tags:

views:

answers:

Should I be able to quote a leading or trailing dollar sign ($) inside a word boundary in Java Regular Expression?

related questions