views:

471

answers:

3

And its not '&'

Im using the SAXParser object do parse the actual XML.

This is normally done by passing a URL to the XMLReader.Parse method. Because my XML is coming from a POST request to a webservice, I am saving that result as a String and then employing StringReader / InputSource to feed this string back to the XMLReader.Parse method.

However, something strange is happening at the 2001st character of the XMLstring.
The 'characters' method of the document handler is being called TWICE in between the startElement and endElement methods, effectively breaking my string (in this case a project title) into two pieces. Because I am instantiating objects in my characters method, I am getting two objects instead of one.

This line, about 2000 chars into the string fires 'characters' two times, breaking between "Lower" and "Level"

<title>SUMC-BOOKSTORE, LOWER LEVEL RENOVATIONS</title>

When I bypass the StringReader / InputSource workaround and feed a flat XML file to XMLReader.Parse, it works absolutely fine.

Something about StringReader and or InputSource is somehow screwing this up.

Here is my method that takes and XML string and parses is through the SAXParser.

    public void parseXML(String XMLstring) {
    try {
        SAXParserFactory spf = SAXParserFactory.newInstance();
        SAXParser sp = spf.newSAXParser();
        XMLReader xr = sp.getXMLReader();
        xr.setContentHandler(this);

        // Something is happening in the StringReader or InputSource 
        // That cuts the XML element in half at the 2001 character mark.

        StringReader sr = new StringReader(XMLstring);
        InputSource is = new InputSource(sr);
        xr.parse(is);


    } catch (IOException e) {
        Log.e("CMS1", e.toString());
    } catch (SAXException e) {
        Log.e("CMS2", e.toString());
    } catch (ParserConfigurationException e) {
        Log.e("CMS3", e.toString());
    }
}

I would greatly appreciate any ideas on how to not have 'characters' firing off twice when I get to this point in the XML String.

Or, show me how to use a POST request and still pass off the URL to the Parse function.

THANK YOU.

+1  A: 

It is legitimate for the characters method to fire multiple times between startElement and endElement in a SAXParser. If your implementation isn't handling it, most likely the ContentHandler being used has in incorrectly coded characters method.

From the code snippet, I think the misbehaving characters method is elsewhere in your code, as you're passing 'this' as ContentHandler. Post that code, and maybe we can help fix it.

See the Javadoc, noting the phrase

SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks

This Javadoc is for ContentHandler. It appears you're using DocumentHandler, which has been deprecated in favor of ContentHandler. But the javadoc for DocumentHandler contains identical language.

Don Roby
Thanks donrobyConsidering that the code only produces poor results when the StringReader and InputSource objects are used, to me it would appear the problems lies in there. When I bypass this implementation, its processes correctly, albeit unsatisfactorily for production. Consider also that regardless of the sort order used on the XML data, the problem occurs 2001 characters into the XML. Thanks!
FauxReal
When you implement things incorrectly, sometimes they work in spite of your error. The problem lies in your code regardless of the fact that it sometimes seems to work.
Don Roby
+3  A: 

As donroby said it's perfectly legitimate for the parser to call the characters method more than once between startElement and endElement. However that isn't "misbehaving" at all and you shouldn't try to finagle things so that it doesn't happen. Your parser seems to be using a 2000-character buffer, but there are other reasons it might break a text node into parts.

What you should do is to accumulate data in the characters method and process it later, in the endElement method when you are sure you have accumulated all of the character data for the node.

Paul Clapham
+1. Yes, the usual handling is to create or attach an accumulator of some sort in the startElement method, accumulate into it in the characters method, and then to use and dispose or detach it in the endElement method.
Don Roby
A: 

Thank you both so much for your responses. With your help I was able to solve the problem.

I was doing the actual processing inside the "characters" method, which is what I learned from an online tutorial.

By moving the processing to the endElement method, I was able to simply concatenate chars together into a string regardless of how many times 'characters' fired.

I accomplished this rather simply by setting up a boolean betweenTags and turning this true during startElement and false at the end of endElement.

Inside characters, I've added

if (betweenTags) accumulation += chars;

The accumulation string is set to "" at the end of startElement.

Works great now, no broken elements.

THANKS!

FauxReal
You're welcome! If you now accept an answer it'll improve someone's reputation and your acceptance ratio.
Don Roby
oh! Okay thanks!
FauxReal