tags:

views:

248

answers:

3

Hello,

I am using the sax parser to parse a XML file. It works fine, but I don't want to parse the content of an <info> tag as it contains HTML which I want to save to a string. Can anyone tell me is there any way to go about doing this?.

Thanks

A: 

This is pseudocode. Adapt before use. Use at your own risk.

This will not take care of <info> tags nested inside the outer info tag.

init:
  ignore = false;

startElement:
  if (!ignore) {
    if (element.name == "info") {
      ignore = true;
    } else {
      process normally
    }
 }

endElement:
  if (ignore) {
    if (element.name == "info") {
      ignore = false;
    }
  } else {
    process normally
  }
Carl Smotricz
But he will still get SAX events for the HTML part, right? So an unclosed `<P>` will spoil everything.
Adrian
True enough. This assumes HTML that's valid XML.
Carl Smotricz
If this is a concern then my alternate solution would be to grab the XML as a String, use a RegExp to strip out all runs between and including info tags, then send the remainder for normal XML parsing. RegExp is considered unsuitable for parsing XML or HTML, but as long as info tags are not nested and don't appear in text strings it should be OK.
Carl Smotricz
+2  A: 

Though question. The best might be to preprocess the stream, escaping the part between <info> and </info> yourself. You could for example write a wrapper around the input stream that transforms your input on the fly, such that what the SAX parser gets is valid XML only.

Adrian
Preprocessing looks like a nice idea thanks!. I'll put in a cdata tag just after it gets the info tag.
Kyle
A: 

Is your XML very large? If not - you can load it all into a string then use XPath queries to access nodes of interest

sylvanaar