views:

457

answers:

1

When using a SAX parser, parsing fails when there is a " in the node content. How can I resolve this? Do I need to convert all " characters?

In other words, anytime I have a quote in a node:

 <node>characters in node containing "quotes"</node>

That node gets butchered into multiple character arrays when the Handler is parsing it. Is this normal behaviour? Why should quotes cause such a problem?

Here is the code I am using:

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.xml.sax.InputSource;
import org.xml.sax.XMLReader;

 ...


HttpGet httpget = new HttpGet(GATEWAY_URL + "/"+ question.getId());
          httpget.setHeader("User-Agent", PayloadService.userAgent);
          httpget.setHeader("Content-Type", "application/xml");

          HttpResponse response = PayloadService.getHttpclient().execute(httpget);
          HttpEntity entity = response.getEntity();

          if(entity != null)
          {        
              SAXParserFactory spf = SAXParserFactory.newInstance();
              SAXParser sp = spf.newSAXParser();            
              XMLReader xr = sp.getXMLReader();            

              ConvoHandler convoHandler = new ConvoHandler();
              xr.setContentHandler(convoHandler);             
              xr.parse(new InputSource(entity.getContent()));                                


              entity.consumeContent();         

               messageList = convoHandler.getMessageList();


          }
A: 

The error is in your handler class referenced in your most recent comment.

A common error in writing a ContentHandler is to assume the characters method is only going to be called once with all the character data. It can in fact be called multiple times with chunks of the character data, which you have to collect. The chopping up into multiple character arrays is normal behavior.

Probably you need to initiate a collector (maybe a StringBuffer) in your startElement method, collect data into it in your characters method and then use the data in your endElement method, which should be where the message.setText shown in your comment is called.

Don Roby
Thank you very much. I was not aware of this - I will refactor my code accordingly. Do you know if there is any rule about when you need to collect? It sounds like this must be done for any text field, but not for boolen or number values. Is this true? Or, should a collector be used for every node parsed?
In XML, it's really all text (at least from the viewpoint of SAX parsing). For data representing booleans and numbers, it's less likely to split into multiple chunks, as they're smaller and don't include as much variation in content, but it theoretically could split.
Don Roby