tags:

views:

71

answers:

4

In Java regular expression, it has "\B" as a non-word boundary.

http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html

If I have a 'char', how can I check it is a non-word boundary?

Thank you.

+5  A: 

The boundary has a special meaning. It has actually a zero-length match and can therefore not be matched on a single character. It is used to determine the position between a non-word char and a word-char. Also see http://regular-expressions.info/wordboundaries.html.

I however understood that this question is more whether the given char can possibly denote the start or end of a word boundary. From the javadoc which you linked (here is the latest version):

Predefined character classes

. Any character (may or may not match line terminators)
\d A digit: [0-9]
\D A non-digit: [^0-9]
\s A whitespace character: [ \t\n\x0B\f\r]
\S A non-whitespace character: [^\s]
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]

So, a word character matches \w. A non-word character matches \W. So:

String string = String.valueOf(yourChar);
boolean nonWordCharacter = string.matches("\\W");
BalusC
Note: this doesn't tell you if it's a boundary, just that it's a non-word char. The concept of a boundary is relevant to an ordered collection and can not be reasonably applied to a single char.
jball
Further clarification, boundary is a context specific term, and examining only a char removes the context used for the `"\B"` regex.
jball
Indeed, the boundary has a special meaning. It has actually a zero-length match. Also see http://regular-expressions.info/wordboundaries.html This is actually used to determine the position between a non-word char and a word-char. I however understood that his question was more whether the given char can possibly denote the start or end of a word boundary.
BalusC
@BalusC, I'd add that last comment to your original question to emphasize the fact that `\b` and `\B` don't match a character but a position, since that is what michael is confused about.
Bart Kiers
@Bart: question updated.
BalusC
@BalusC, cheers! ` `
Bart Kiers
+1  A: 
((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'))

or if you want to digits to be also parts of a word:

((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || (c >= '0' && c <= '9'))
zed_0xff
+1  A: 

A boundary is a position between two characters, so a character can never be a boundary.

If you want to match a character that is not surrounded by word boundaries, e. g. the character b in abc, then you can use

\B.\B

Remember to escape the backslashes in a Java string, as in

Pattern regex = Pattern.compile("\\B.\\B");
Tim Pietzcker
In practice, it's fine to define boundaries as something that exists only between two characters. However, it's actually more liberal than that, at least in Java. See my answer.
polygenelubricants
A: 

The question is very peculiar, but it's true that a \w on its own is surrounded by \b. Similarly, a \W on its own is surrounded by \B. So for the purpose of word boundary definitions, ^ and $ are non-word characters.

    System.out.println("a".matches("^\\b\\w\\b$")); // true
    System.out.println("a".matches("^\\b\\w\\B$")); // false
    System.out.println("a".matches("^\\B\\w\\b$")); // false
    System.out.println("a".matches("^\\B\\w\\B$")); // false

    System.out.println("@".matches("^\\b\\W\\b$")); // false
    System.out.println("@".matches("^\\b\\W\\B$")); // false
    System.out.println("@".matches("^\\B\\W\\b$")); // false
    System.out.println("@".matches("^\\B\\W\\B$")); // true

    System.out.println("".matches("$$$$\\B\\B\\B\\B^^^")); // true

The last line may be surprising, but such is the nature of anchors.

See also

polygenelubricants