tags:

views:

267

answers:

1

Hi all
I'm parsing an XML document using SAX in Java.
I'm working with the XML that describes research publications in different fields.
Among others there are elements like "abstract" that shortly describes what the reserch paper is about. The basic HTML formatting is allowed in that field, but I don't want the SAX to threat the HTML tags (like i,b,u,sub,sup an so on) as real XML tags and fire strartElement() and endElement() events on that elements.

Is there a way to tell to SAX to ignore some predefined set of XML tags and to pass theirs XML code as is to the characters() method?

A: 

I suspect not, without some work. I would perhaps slot in different SAX handlers as you encounter different elements, and push/pop them off a stack. So when you encounter an <abstract> element, you slot in a new handler that the SAX parser delegates to, and that is intelligent enough to process your HTML elements as you require. Not a trivial solution, I'm afraid.

Brian Agnew
Even in that way I'll have to convert the data passed in the startElement() back to the XML. I think this would waste time: SAX would parse XML to java objects and I would convert them back to the XML. Anyway I accept this answer because the question was if there is such a way, so the answer "no" is legal :)
jutky
If you really want to skip tags, then you will have to use random access capability not avilabe in SAX or StAX, try DOM, jDOM, vtd-xml etc
vtd-xml-author