ansaurus

Question

Answer 1

+1 A:

Since you're comparing with a string literal, you need to make sure that you're saving your source file in the same encoding that javac is expecting. You can also specify what encoding your source files are in with the -encoding argument to javac.

That seems like the most likely "gotcha" in this scenario.

Note that I'm talking about the encoding of your Java source code, not the XML document.

zerocrates 2010-05-08 03:14:05

Answer 2

+1 A:

Java strings are always UTF-16. Your XML parser should be converting the file's UTF-8 characters into UTF-16 while reading, and your own strings are already UTF-16 in memory, so you can compare them with an ordinary equals() call. If they aren't comparing equal when you think they should, the problem is likely something else.

Wyzard 2010-05-08 03:14:49

Answer 3

+2 A:

Java stores Strings internally as an array of chars, which are 16 bit unsigned values. This was based on an earlier Unicode standard that supported 64K characters.

Your String constant "Côte d'Ivoire" is in this format. If your character encoding on your XML document is correct then the String read from there will also be in the correct format. So possible errors are:

The XML document doesn't declare a character encoding;
The declared character encoding does not match the actual character encoding used.

Perhaps the XML string is being treated as US-ASCII instead of UTF-8. I would output both and eyeball them. If they look the same, compare them character by character to see where teh comparison fails. You may also want to compare the UTF8 encoding of your constant String to what's in the XML document:

byte[] bytes = "Côte d'Ivoire".getBytes("UTF-8");

It gets more complicated when you start getting into "supplementary characters". These are characters beyond the originally intended 64K ("code points" in Unicode parlance). See Supplementary Characters in the Java Platform. This isn't an issue with any of the characters you're using but it's worth noting for completeness.

cletus 2010-05-08 03:14:49

Correct! My xml document has no tag specifying character encoding.What is default character encoding assumed ?

cppdev 2010-05-08 03:19:51

@cppdev UTF-8 is assumed *generally*. See http://www.opentag.com/xfaq_enc.htm

cletus 2010-05-08 03:23:57

@cppdev but what you need to do is see where the comparison fails. Compare character by character and byte by byte to see where the difference lies. From there you/we should be able to figure out the why.

cletus 2010-05-08 03:26:06

@cppdev What is `getXMLNodeString()`? Does it return a `String` or something else? Possibly a text node? That's not a standard function. Maybe you should post the code to that.

cletus 2010-05-08 03:31:59

@cletus - it gets *even more* complicated when when you take into account composed and decomposed forms of accented characters and ligatures.

Stephen C 2010-05-08 04:38:44

ansaurus

tags:

views:

answers:

Comparing utf-8 strings in java

related questions