While searching for a proper way to trim non-breaking space from parsed HTML, I've first stumbled on java's spartan definition of String.trim()
which is at least properly documented. I wanted to avoid explicitly listing characters eligible for trimming, so I assumed that using Unicode backed methods on Character class would do the job for me.
That's when I discovered that Character.isWhitespace(char) explicitly excludes non-breaking spaces:
It is a Unicode space character (
SPACE_SEPARATOR
,LINE_SEPARATOR
, orPARAGRAPH_SEPARATOR
) but is not also a non-breaking space ('\u00A0'
,'\u2007'
,'\u202F'
).
Why is that?
The implementation of corresponding .NET equivalent is less discriminating.