views:

275

answers:

3

I have a contact that is experiencing trouble with SAX when parsing RSS and Atom files. According to him, it's as if text coming from the Item elements is truncated at an apostrophe or sometimes an accented character. There seems to be a problem with encoding too.

I've given SAX a try and I have some truncating taking place too but haven't been able to dig further. I'd appreciate some suggestions if someone out there has tackled this before.

This is the code that's being used in the ContentHandler:

public void characters( char[], int start, int end ) throws SAXException {
//
    link = new String(ch, start, end);

Edit: The encoding problem might be due to storing information in a byte array as I know Java works in Unicode.

+1  A: 

How are you passing the input to SAX? As InputStream (recommended) or Reader? So, starting from your byte[], try using the ByteArrayInputStream.

Egon Willighagen
Egon, I've taken a peek in the Channel class and an XMLReader is being used. The ContentHandler is set and then the parse() method is called. That seems about it.
James P.
You could have a look at my code: http://cdk.git.sourceforge.net/git/gitweb.cgi?p=cdk/cdk;a=blob;f=src/main/org/openscience/cdk/io/CMLReader.java;h=490743955939b8a003c95769c3261b06eb341842;hb=HEAD
Egon Willighagen
BTW, what XML parser are you using? The code I just linked allows for three different XML parsers, default to the on that comes with (newer) Java version, then Aelfred, last Xerces.
Egon Willighagen
Sorry, if it took a while replying. I had a look at your code and it appears that the default parser is being used.
James P.
+4  A: 

The characters() method is not guaranteed to give you the complete character content of a text element in one pass - the full text may span buffer boundaries. You need to buffer the characters yourself between the start and end element events.

e.g.

StringBuilder builder;

public void startElement(String uri, String localName, String qName, Attributes atts) {
   builder = new StringBuilder();
}

public void characters(char[] ch, int start, int length) {
   builder.append(ch,start,length);
}

public void endElement(String uri, String localName, String qName) {
  String theFullText = builder.toString();
}
Alex Vigdor
Shouldn't the StringBuilder appending be synchronized? or use a StringBuffer?
ruchirhhi
No synchronization required - SAX parsing is single threaded, and generally one would use a separate ContentHandler for each document parsed. If you want to reuse ContentHandlers you would be best off using a ThreadLocal or other pooling mechanism - it would be very difficult to write a ContentHandler that could simultaneously handle multiple parsing streams in separate threads, as how would it keep track of which event came from which document?
Alex Vigdor
+2  A: 

XML entities generate special events in SAX. You can catch them with a LexicalHandler, though it's generally not necessary. But this explain why can't assume that you will recieve only one characters event per tag. Use a buffer as explained in other answers.

For instance hello&world will generate the sequence

  • startElement
  • characters hello
  • startEntity
  • characters &
  • endEntity
  • characters world

Have a look at Auxialiary SAX interface, if you want some more examples. Other special events are external entities, comments, CDATA, etc.

ewernli