views:

38

answers:

2

I'm having trouble getting regular expressions with leading / trailing $'s to match in Java (1.6.20).

From this code:

System.out.println( "$40".matches("\\b\\Q$40\\E\\b") );
System.out.println( "$40".matches(".*\\Q$40\\E.*") );
System.out.println( "$40".matches("\\Q$40\\E") );
System.out.println( " ------ " );
System.out.println( "40$".matches("\\b\\Q40$\\E\\b") );
System.out.println( "40$".matches(".*\\Q40$\\E.*") );
System.out.println( "40$".matches("\\Q40$\\E") );
System.out.println( " ------ " );
System.out.println( "4$0".matches("\\b\\Q4$0\\E\\b") );
System.out.println( "40".matches("\\b\\Q40\\E\\b") );

I get these results:

false
true
true
 ------ 
false
true
true
 ------ 
true
true

The leading false in the first two blocks seem to be the problem. That is, the leading/trailing $ (dollar sign) is not picked up properly in the context of the \b (word boundary) marker.

The true results in the blocks show it's not the quoted dollar sign itself, since replacing the \b with a .* or removing all together get the desired result.

The last two "true" results show that the issue is neither with an internally quoted $ nor with matching on word boundaries (\b) within quoted expression "\Q ... \E".

Is this a Java bug or am I missing something?

+1  A: 

This is because \b matches word boundaries. And the position immediately in before or after a $ character does not necessarily count as a word boundary.

A word boundary is the position between \w and \W, and $ is not part of \w. On the example of the string "bla$", word boundaries are:

" b l a $ "
 ^----------- here

" b l a $ "
       ^----- here

" b l a $ "
         ^--- but not here
Tomalak
A: 

Tomalak nailed it - it's about word boundary matching. I had figured it out and deleted the question, but Will's advice to keep open for others is sound.

The \b was, in fact, the culprit.

One conclusion could be that for anything but the most rudimentary (i.e. ASCII) uses, the built-in convenience expressions from Java are effectively useless. Eg. \w only matches ASCII characters, \b is based on that, etc.

FWIW, my RegExp ended up being:

   (?:^|[\p{P}\p{Z}])(\QThe $earch Term\E)(?:[\p{P}\p{Z}]|$)

where The $earch Term is the text I'm trying to match.

The \p{} are the Unicode categories. Basically, I'm breaking my word on any character in the Punctuation (P) or Separator (Z) Unicode character categories. As well, the start and end of the input are respected (with ^ and $) and the boundary markers are tagged as non-matching groups (the (?:...) bits) while the actual search term is quoted with \Q and \E & placed in a matching group.

Intellectual Tortoise
`\b` works fine; you were just trying to use it in the wrong place. And `\b` *is* Unicode savvy; it uses `Character.isLetterOrDigit()`, not `\w`, to decide what's a word character and what isn't.
Alan Moore