tags:

views:

170

answers:

3

In my java program, I am retrieving some data from xml. This xml has few international characters and is encoded in utf8. Now I read this xml using xml parser. Once I retrieve a particular international string from xml parser, I need to compare it with set of predefined strings. Problem is when I use string.equals on internatinal string comparison fails.

How to compare strings with international strings in java ? I am using SAXParser & XMLReader to read strings from xml.

Here's the line that compares strings

 String country;
 country = getXMLNodeString();

 if(country.equals("Côte d'Ivoire"))
 {    

 } 

  getXMLNodeString()
  {

  /* Get a SAXParser from the SAXPArserFactory. */  
        SAXParserFactory spf = SAXParserFactory.newInstance();
        SAXParser sp = spf.newSAXParser();

        /* Get the XMLReader of the SAXParser we created. */
        XMLReader xr = sp.getXMLReader();
        /* Create a new ContentHandler and apply it to the XML-Reader*/
        XmlParser xmlParser = new XmlParser();  //my class to parse xml
        xr.setContentHandler(xmlParser);  

        /* Parse the xml-data from our URL. */
        xr.parse(new InputSource(url.openStream()));
        /* Parsing has finished. */


       //return string here
  }
+1  A: 

Since you're comparing with a string literal, you need to make sure that you're saving your source file in the same encoding that javac is expecting. You can also specify what encoding your source files are in with the -encoding argument to javac.

That seems like the most likely "gotcha" in this scenario.

Note that I'm talking about the encoding of your Java source code, not the XML document.

zerocrates
+1  A: 

Java strings are always UTF-16. Your XML parser should be converting the file's UTF-8 characters into UTF-16 while reading, and your own strings are already UTF-16 in memory, so you can compare them with an ordinary equals() call. If they aren't comparing equal when you think they should, the problem is likely something else.

Wyzard
+2  A: 

Java stores Strings internally as an array of chars, which are 16 bit unsigned values. This was based on an earlier Unicode standard that supported 64K characters.

Your String constant "Côte d'Ivoire" is in this format. If your character encoding on your XML document is correct then the String read from there will also be in the correct format. So possible errors are:

  1. The XML document doesn't declare a character encoding;

  2. The declared character encoding does not match the actual character encoding used.

Perhaps the XML string is being treated as US-ASCII instead of UTF-8. I would output both and eyeball them. If they look the same, compare them character by character to see where teh comparison fails. You may also want to compare the UTF8 encoding of your constant String to what's in the XML document:

byte[] bytes = "Côte d'Ivoire".getBytes("UTF-8");

It gets more complicated when you start getting into "supplementary characters". These are characters beyond the originally intended 64K ("code points" in Unicode parlance). See Supplementary Characters in the Java Platform. This isn't an issue with any of the characters you're using but it's worth noting for completeness.

cletus
Correct! My xml document has no tag specifying character encoding.What is default character encoding assumed ?
cppdev
@cppdev UTF-8 is assumed *generally*. See http://www.opentag.com/xfaq_enc.htm
cletus
@cppdev but what you need to do is see where the comparison fails. Compare character by character and byte by byte to see where the difference lies. From there you/we should be able to figure out the why.
cletus
@cppdev What is `getXMLNodeString()`? Does it return a `String` or something else? Possibly a text node? That's not a standard function. Maybe you should post the code to that.
cletus
@cletus - it gets *even more* complicated when when you take into account composed and decomposed forms of accented characters and ligatures.
Stephen C