views:

831

answers:

6

I am using Java regexes in Java 1.6 (inter alia to parse numeric output) and cannot find a precise definition of \b ("word boundary"). I had assumed that "-12" would be an "integer word" (matched by \b\-?\d+\b) but it appears that this does not work. I'd be grateful to know of ways of matching space-separated numbers.

Example:

    Pattern pattern = Pattern.compile("\\s*\\b\\-?\\d+\\s*");
    String plus = " 12 ";
    System.out.println(""+pattern.matcher(plus).matches());
    String minus = " -12 ";
    System.out.println(""+pattern.matcher(minus).matches());
    pattern = Pattern.compile("\\s*\\-?\\d+\\s*");
    System.out.println(""+pattern.matcher(minus).matches());

returns

true
false
true
A: 

I think it's the boundary (i.e. character following) of the last match or the beginning or end of the string.

Peter
A: 

I believe that your problem is due to the fact that - is not a word character. Thus, the word boundary will match after the -, and so will not capture it. Word boundaries match before the first and after the last word characters in a string, as well as any place where before it is a word character or non-word character, and after it is the opposite. Also note that word boundary is a zero-width match.

One possible alternative is

(?:(?:^|\s)-?)\d+\b

This will match any numbers starting with a space character and an optional dash, and ending at a word boundary. It will also match a number starting at the beginning of the string.

Sean Nyman
+5  A: 

A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-Z_]).

So, in the string "-12", it would match before the 1 or after the 2. The dash is not a word character.

brianary
Correctamundo. `\b` is a zero-width assertion that matches if there is `\w` on one side, and either there is `\W` on the other or the position is beginning or end of string. `\w` is arbitrarily defined to be "identifier" characters (alnums and underscore), not as anything especially useful for English.
hobbs
100% correct. Apologies for not just commenting on yours. I hit submit before I saw your answer.
Brent Nash
+1  A: 

Check out the documentation on boundary conditions:

http://java.sun.com/docs/books/tutorial/essential/regex/bounds.html

Check out this sample:

public static void main(final String[] args)
    {
     String x = "I found the value -12 in my string.";
     System.err.println(Arrays.toString(x.split("\\b-?\\d+\\b")));
    }

When you print it out, notice that the output is this:

[I found the value -, in my string.]

This means that the "-" character is not being picked up as being on the boundary of a word because it's not considered a word character. Looks like @brianary kinda beat me to the punch, so he gets an up-vote.

Brent Nash
A: 

A word boundary is a position. It can be one of three positions.

  1. Before the first character in the string, if the first character is a word character.
  2. After the last character in the string, if the last character is a word character.
  3. Between two characters in the string, where one is a word character and the other is not a word character.

Word characters are alpha-numeric characters. A minis sign is a non word character. Taken from Regex Tutorial.

WolfmanDragon
A: 

A word boundary is a position that is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one.

Alan Moore