tags:

views:

1403

answers:

5

While searching for a proper way to trim non-breaking space from parsed HTML, I've first stumbled on java's spartan definition of String.trim() which is at least properly documented. I wanted to avoid explicitly listing characters eligible for trimming, so I assumed that using Unicode backed methods on Character class would do the job for me.

That's when I discovered that Character.isWhitespace(char) explicitly excludes non-breaking spaces:

It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F').

Why is that?

The implementation of corresponding .NET equivalent is less discriminating.

+2  A: 

It looks like the method name (isWhitespace) is inconsistent with its function (to detect separators). The "separator" functionality is fairly clear if you look at the full list of characters from the Javadoc page you quoted:

* It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F').
* It is '\u0009', HORIZONTAL TABULATION.
* It is '\u000A', LINE FEED.
* It is '\u000B', VERTICAL TABULATION.
* It is '\u000C', FORM FEED.
* It is '\u000D', CARRIAGE RETURN.
* It is '\u001C', FILE SEPARATOR.
* It is '\u001D', GROUP SEPARATOR.
* It is '\u001E', RECORD SEPARATOR.
* It is '\u001F', UNIT SEPARATOR.

A non-breaking space's function is supposed to be visual space between words that is not separated by hyphenation algorithms.

Jason S
+3  A: 

I would argue that Java's implementation is more correct than .NET's. The non-breaking space is essentially a non-whitespace character that looks like one. That is, if you have the strings "foo" and "bar", and put any traditional whitespace character in between them, you would get a word break. A non-breaking space, however, does not break the two up.

Matt Poush
A non-breaking space is still a word boundary. The "breaking" in "non-breaking space" refers to how it should be interpreted for purposes of **line**-breaking, not word breaks.
richardtallent
+2  A: 

Character.isWhitespace(char) is old. Really old. Many things done in the early days of Java followed conventions and implementations from C.

Now, more than a decade later, these things seem erroneous. Consider it evidence how far things have come, even between the first days of Java and the first days of .NET.

Java strives to be 100% backward compatible. So even if the Java team thought it would be good to fix their initial mistake and add non-breaking spaces to the set of characters that returns true from Character.isWhitespace(char), they can't, because there almost certainly exists software that relies on the current implementation working exactly the way it does.

Steve McLeod
+3  A: 

The only time a non-breaking space should be treated specially is with code designed to perform word-wrapping of text.

For all other purposes, including word counts, trimming, and general-purpose splitting along word boundaries, a non-breaking space is still whitespace.

Any argument that a non-breaking space just "looks like" a space but isn't one conflicts with the whole point of Unicode, which represents characters based on their meaning, not how they are displayed.

Thus, IMHO, the Java implementation of String.trim() is not performing as expected, and the underlying Character.isWhitespace() function is at fault.

My guess is that the Java implementors wrote isWhitespace() based on the need to perform text-wrapping within controls. They should have named this function isWordWrappingBoundary() or something more clear, and used a less-restrictive whitespace test for trim().

richardtallent
String.trim() is even more broken than that. It just trims ASCII control characters, and no Unicode whitespace at all, breaking or not.
Thilo
A: 

Since Java 5 there is also an isSpaceChar(int) method. Does that not do what you want?

Determines if the specified character (Unicode code point) is a Unicode space character. A character is considered to be a space character if and only if it is specified to be a space character by the Unicode standard. This method returns true if the character's general category type is any of the following: ...

Jesper
It's not so much the existence of such a method that the OP was looking for; but rather a `trim`-type function that *uses* that method to determine what to strip.
Andrzej Doyle