views:

893

answers:

1

I've encountered (what I think is) a strange behavior when using the sax parser, and I wanted to know if it's normal.

I'm sending this XML through the SAX parser:

The "& a m p ;" gets converted to " & # 3 8 ;" when the startElement callback is called. Is it supposed to do that? If so, I would like to understand why.

I've pasted an example demonstrating the issue here:

#include <stdlib.h>
#include <libxml/parser.h>

static void start_element(void * ctx, const xmlChar *name, const xmlChar **atts)
{
  int i = 0;
  while(atts[i] != NULL) {
    printf("%s\n", atts[i]);
    i++;
  }
}

int main(int argc, char *argv[]) {
  xmlSAXHandlerPtr handler = calloc(1, sizeof(xmlSAXHandler));
  handler->startElement = start_element;

  char * xml = "<site url=\"http://example.com/?a=b&amp;amp;b=c\" />";

  xmlSAXUserParseMemory( handler,
                          NULL,
                          xml,
                          strlen(xml)
  );
}

Thank you!

PS: this message is actually extracted from the LibXML2 list... and I am not the initial author of this mail, but I noticed the problem using Nokogiri and Aaron (the maintainer of Nokogiri) actually posted this message himself.

+4  A: 

This message describes the same problem (which I had as well) and the response says to

ask the parser to replace entities values

What that means is when you are setting up your context, set the option like this:

xmlParserCtxtPtr context = xmlCreatePushParserCtxt(&yourSAXHandlerStruct, self, NULL, 0, NULL);
xmlCtxtUseOptions(context, XML_PARSE_NOENT);
Don