views:

151

answers:

3

Hi,

I have to parse the content I get from the web and it can contain special characters. In this case the content string appears like the following:

<?xml version="1.0" encoding="UTF-8"?>
<products>
  <product>
    <id>1</id>
    <price>2.14</price>
    <title>test &#382; test</title>

When the contet above is passed to the method characters(), in the class which is extended from org.xml.sax.helpers.DefaultHandler:

public class ProductsXMLHandler extends DefaultHandler {
...

@Override    
public void characters(char[] ch, int start, int length)
            throws SAXException {
        String elementValue = new String(ch, start, length);
    ...
}

I noticed the array test &#382; test is broken into three arrays: 'test ', '&#382;' and ' test' so the elementValue is not equal test &#382; test which should be the result. Does anyone know how to solve the problem?

Is it necessary to recode the source string:

 <?xml version="1.0" encoding="UTF-8"?>
<products>
  <product>
    <id>1</id>
    <price>2.14</price>
    <title>test &#382; test</title>

before it is passed to XML handler class?

Thank you!

+2  A: 

Do you mean that characters is being called three times? If so, you just need to make your code handle that - the parser is perfectly at liberty to do this. You shouldn't assume that you'll get all character data in one call.

From the documentation for DocumentHandler.characters():

SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity, so that the Locator provides useful information.

Jon Skeet
+3  A: 

As Jon Skeet said in in answer, characters is called multiple times. What you should do is the following :

  • in startTag, create a StringBuffer, and note (in a boolean value for example) if you are in the right tag you are searching for.
  • in characters, if you are in the right tag (if the boolean set earlier is true), put the characters in the StringBuffer
  • in endTag, if you are getting out of the right tag (see boolean, same thing as earlier), take the content of the StringBuffer and voilà ! Here is your complete string. Don't forget to empty the StringBuffer after that.
Valentin Rocher
+2  A: 

I don't think you can do anything about it, this is per the SAX API. Specifically, from http://java.sun.com/javase/6/docs/api/org/xml/sax/ContentHandler.html#characters(char[],%20int,%20int)

The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information.

(My emphasis)

Jack Leow