views:

490

answers:

2

I'm using Java's DocumentBuilder.parse(InputStream) to parse an XML document. Occasionally, I get malformed XML documents in that there is extra junk after the final > that causes a SAXException: Content is not allowed in trailing section. (In the cases I've seen, the junk is simply one or more null bytes.)

I don't care what's after the final >. Is there an easy way to parse an entire XML document in Java and have it ignore any trailing junk?

Note that by "ignore" I don't simply mean to catch and ignore the exception: I mean to ignore the trailing junk, throw no exception, and to return the Document object since the XML up to an including the final > is valid.

A: 

No. A document that contains trailing characters is not an XML document. Fix the sender.

bkail
I have no control over the sender. And your "answer" is not in the spirit of "Be liberal in what you accept and strict in what you emit."
Paul J. Lucas
You asked "is there an easy way to parse an entire XML document in Java and have it ignore any trailing junk?" The answer is "no, there is not", and I gave the reason. Maybe you're looking for http://home.ccil.org/~cowan/XML/tagsoup/ ? Maybe you know that your XML doesn't have CDATA and you can implement a primitive inputStream wrapper? I'm not sure what answer you're looking for.
bkail
Every XML parser keeps track of the every element and knows when said element has been "closed" by parsing the > of its closing tag. That means that every XML parser also knows the final > when it sees it because the first element has been balanced by its closing tag. At that point, I want the parser simply to stop. You're making this more complicated than it is.
Paul J. Lucas
I'm not trying to make this complicated. I understand that what you want is conceptually simple, but it doesn't exist. Your only options are to either: use a non-compliant (or non-XML) parser, modify an existing XML parser to do what you want, or preprocess the input.
bkail
Hopefully the downvote can be removed now that someone else has given the same answer.
bkail
They may have given the same "base answer," but at least they offered ways to actually solve the problem whereas your original answer did not other than the terse and unhelpful "fix the sender."
Paul J. Lucas
The other answer suggests you either: (1) preprocess the input, or (2) catch exceptions. You explicitly stated that #2 was not an option. You dismissed #1 when I suggested it in a comment, so I didn't bother to update my answer. Oh well.
bkail
+1  A: 

Since your sender is presenting you with invalid XML, it needs to be corrected before it hits the parser if you want to avoid this exception. If you can't correct the sender, you'll need a preprocessing step of some sort.

If the situation is simply that you've got extra null bytes after the closing tag as indeicated by one of your responses to another answer, this might be something you can accomplish easily by wrapping your input stream in a FilterInputStream that you implement to skip null bytes.

If the problem is more complex than just null characters, you'll of course need a more complex filter, which might be difficult.

If you're using a ContentHandler, you can add a callback to it so that it can inform the calling code when the ending root tag has been handled, and based on that knowledge, the calling code can have logic in its handler for the exception to simply ignore it if the end has been signalled. At that point anything that had to be done by the parser has likely been done anyway! But this solution doesn't seem to apply for your situation.

Don Roby