tags:

views:

71

answers:

1

Hi,

I am trying to scrape some content from an HTML page. I'm using libxml2 and htmlReadMemory to get a xmlDocPtr. The HTML is simple, but it has a problem. It's basically the following:

<tr><td><tr><td>Some content</td></tr></td></tr>

libxml doesn't like the nested tr, tds. It keeps giving me the following error:

HTML parser error : Unexpected end tag : td
      </TD>
           ^
HTML parser error : Unexpected end tag : tr
    </TR>

I am using the following option: HTML_PARSE_RECOVER.

At this point nothing i do allows libxml to parse the HTML because of this. I can't change the HTML because I have no access to it.

Anyone have any clues how I can get libxml to parse this sort of HTML?

Thanks

+1  A: 

What's the exact call you're using to parse? I'd suggest combining these options if you don't want any errors/warnings:

HTML_PARSE_RECOVER|HTML_PARSE_NOERROR|HTML_PARSE_NOWARNING
bosmacs
I do this: theDoc = htmlReadMemory([inData bytes], [inData length], NULL, enc, HTML_PARSE_RECOVER | HTML_PARSE_NOWARNING | HTML_PARSE_NOBLANKS);
Felix Khazin
Does using HTML_PARSE_NOERROR still parse the document even if there are errors in the HTML?
Felix Khazin
Actually, i put in HTML_PARSE_NOERROR and now it's working. Thanks for that!
Felix Khazin
I believe libxml will still correctly parse the document in most cases, but it probably depends how badly mangled it is.
bosmacs