ansaurus

Question

Answer 1

A:

This is pseudocode. Adapt before use. Use at your own risk.

This will not take care of <info> tags nested inside the outer info tag.

init:
  ignore = false;

startElement:
  if (!ignore) {
    if (element.name == "info") {
      ignore = true;
    } else {
      process normally
    }
 }

endElement:
  if (ignore) {
    if (element.name == "info") {
      ignore = false;
    }
  } else {
    process normally
  }

Carl Smotricz 2009-12-10 13:43:25

But he will still get SAX events for the HTML part, right? So an unclosed `<P>` will spoil everything.

Adrian 2009-12-10 13:45:08

True enough. This assumes HTML that's valid XML.

Carl Smotricz 2009-12-10 13:48:05

If this is a concern then my alternate solution would be to grab the XML as a String, use a RegExp to strip out all runs between and including info tags, then send the remainder for normal XML parsing. RegExp is considered unsuitable for parsing XML or HTML, but as long as info tags are not nested and don't appear in text strings it should be OK.

Carl Smotricz 2009-12-10 13:50:13

Answer 2

+2 A:

Though question. The best might be to preprocess the stream, escaping the part between <info> and </info> yourself. You could for example write a wrapper around the input stream that transforms your input on the fly, such that what the SAX parser gets is valid XML only.

Adrian 2009-12-10 13:43:43

Preprocessing looks like a nice idea thanks!. I'll put in a cdata tag just after it gets the info tag.

Kyle 2009-12-10 19:42:00

Answer 3

A:

Is your XML very large? If not - you can load it all into a string then use XPath queries to access nodes of interest

sylvanaar 2009-12-10 14:03:46

ansaurus

tags:

views:

answers:

Sax parser: Ignoring HTML

related questions