views:

537

answers:

4

Hi,

On extracting some html from a web page, I have some elements containing text that end in an unknown or non-matching whitespace character (ie does not match "\\s"):

<span>Monday </span>

In java, to check what this character is, I am doing:

String s = getTheSpanContent();
char c = s.charAt(s.length() -1);
int i = (int) c;

and the value of i is: 160

Anyone know what this is? And how I can match for it?

Thanks

+2  A: 

That's \u00A0, also known as non-breaking space. If you know HTML, you'll understand that it's the same space as represented by &nbsp;. Apparently one was using it instead of a normal space.

BalusC
ok that makes sense. I can see in the source   - but my parser was converting it to this character that didn't match \\s. cheers
Richard
It wasn't clear from your question that you were looking for a regex pattern to match the particular character. Anyway, it's good to know that decimal 160 equals to hex A0 so that you after all know which hexcode to use in regex and unicode charts :)
BalusC
+4  A: 

It's a non-breaking space. According to the Pattern Javadocs, \\s matches [ \t\n\x0B\f\r], so you'll have to explicitly add \xA0 to your regex if you want to match it.

Michael Myers
ok, thanks - this ties with the html source. Joel's solution below also works.
Richard
so my regex for matching all whitespace inc. non-breaking is: "[\\s\\xA0]+" - appears to work. cheers for help.
Richard
+3  A: 

Try using this to match whitespace

Character.isSpaceChar(c) || c <= ' ';
Joel
+2  A: 

I made this spreadsheet once to show you some of the different definitions of whitespace used in Java. It's pretty fascinating (if you're weird like me).

http://spreadsheets.google.com/pub?key=pd8dAQyHbdewRsnE5x5GzKQ

Incidentally, Guava's CharMatcher.WHITESPACE tracks the latest Unicode 5.x standard, and is thus often a better choice than JDK methods.

http://guava-libraries.googlecode.com

Kevin Bourrillion