ansaurus

Question

Why is non-breaking space not a whitespace character in java?

Answer 1

+2 A:

It looks like the method name (isWhitespace) is inconsistent with its function (to detect separators). The "separator" functionality is fairly clear if you look at the full list of characters from the Javadoc page you quoted:

* It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F').
* It is '\u0009', HORIZONTAL TABULATION.
* It is '\u000A', LINE FEED.
* It is '\u000B', VERTICAL TABULATION.
* It is '\u000C', FORM FEED.
* It is '\u000D', CARRIAGE RETURN.
* It is '\u001C', FILE SEPARATOR.
* It is '\u001D', GROUP SEPARATOR.
* It is '\u001E', RECORD SEPARATOR.
* It is '\u001F', UNIT SEPARATOR.

A non-breaking space's function is supposed to be visual space between words that is not separated by hyphenation algorithms.

Jason S 2009-06-29 21:14:39

Answer 2

+3 A:

I would argue that Java's implementation is more correct than .NET's. The non-breaking space is essentially a non-whitespace character that looks like one. That is, if you have the strings "foo" and "bar", and put any traditional whitespace character in between them, you would get a word break. A non-breaking space, however, does not break the two up.

Matt Poush 2009-06-29 21:16:21

A non-breaking space is still a word boundary. The "breaking" in "non-breaking space" refers to how it should be interpreted for purposes of **line**-breaking, not word breaks.

richardtallent 2009-06-29 22:20:34

Answer 3

+2 A:

Character.isWhitespace(char) is old. Really old. Many things done in the early days of Java followed conventions and implementations from C.

Now, more than a decade later, these things seem erroneous. Consider it evidence how far things have come, even between the first days of Java and the first days of .NET.

Java strives to be 100% backward compatible. So even if the Java team thought it would be good to fix their initial mistake and add non-breaking spaces to the set of characters that returns true from Character.isWhitespace(char), they can't, because there almost certainly exists software that relies on the current implementation working exactly the way it does.

Steve McLeod 2009-06-29 21:50:32

Answer 4

+3 A:

The only time a non-breaking space should be treated specially is with code designed to perform word-wrapping of text.

For all other purposes, including word counts, trimming, and general-purpose splitting along word boundaries, a non-breaking space is still whitespace.

Any argument that a non-breaking space just "looks like" a space but isn't one conflicts with the whole point of Unicode, which represents characters based on their meaning, not how they are displayed.

Thus, IMHO, the Java implementation of String.trim() is not performing as expected, and the underlying Character.isWhitespace() function is at fault.

My guess is that the Java implementors wrote isWhitespace() based on the need to perform text-wrapping within controls. They should have named this function isWordWrappingBoundary() or something more clear, and used a less-restrictive whitespace test for trim().

richardtallent 2009-06-29 21:52:30

String.trim() is even more broken than that. It just trims ASCII control characters, and no Unicode whitespace at all, breaking or not.

Thilo 2009-06-30 01:30:54

Answer 5

A:

Since Java 5 there is also an isSpaceChar(int) method. Does that not do what you want?

Determines if the specified character (Unicode code point) is a Unicode space character. A character is considered to be a space character if and only if it is specified to be a space character by the Unicode standard. This method returns true if the character's general category type is any of the following: ...

Jesper 2009-09-17 10:58:04

It's not so much the existence of such a method that the OP was looking for; but rather a `trim`-type function that *uses* that method to determine what to strip.

Andrzej Doyle 2009-09-17 11:00:30

ansaurus

tags:

views:

answers:

Why is non-breaking space not a whitespace character in java?

related questions