views:

86

answers:

1

I'm using libxml2 to parse HTML:

static htmlSAXHandler simpleSAXHandlerStruct = {
    NULL,                       /* internalSubset */
    NULL,                       /* isStandalone   */
    NULL,                       /* hasInternalSubset */
    NULL,                       /* hasExternalSubset */
    NULL,                       /* resolveEntity */
    NULL,                       /* getEntity */
    NULL,                       /* entityDecl */
    NULL,                       /* notationDecl */
    NULL,                       /* attributeDecl */
    NULL,                       /* elementDecl */
    NULL,                       /* unparsedEntityDecl */
    NULL,                       /* setDocumentLocator */
    NULL,                       /* startDocument */
    NULL,                       /* endDocument */
    NULL,                       /* startElement*/
    NULL,                       /* endElement */
    NULL,                       /* reference */
    charactersFoundSAX,         /* characters */
    NULL,                       /* ignorableWhitespace */
    NULL,                       /* processingInstruction */
    NULL,                       /* comment */
    NULL,                       /* warning */
    errorEncounteredSAX,        /* error */
    NULL,                       /* fatalError //: unused error() get all the errors */
    NULL,                       /* getParameterEntity */
    NULL,                       /* cdataBlock */
    NULL,                       /* externalSubset */
    XML_SAX2_MAGIC,             //
    NULL,
    startElementSAXP,           /* startElementNs */
    endElementSAXP,             /* endElementNs */
    NULL,                       /* serror */
};

The charactersFoundSAX and errorEncounteredSAX functions do get called, but the startElementSAXP and endElementSAXP functions never get called.

If I change the parsing from HTML and parse XML instead (and change all the definitions including 'html' to 'xml', e.g. into xmlSAXHandler), the functions do get called correctly.

Why is that?

A: 

HTML is not namespace aware and hence using just the startElementNs/endElementNs function slots in a SAX parser will result in your observed behaviour.

Simple fix: Fill in the startElement/endElement slots.

You can easily use wrappers to match the different signature and then call just the one underlying function in both XML and HTML mode.

hroptatyr
As discussed in the comments, this works. Thanks!