ansaurus

Question

Answer 1

A:

If the content is XHTML, you can use libxml2 to parse it (since it is in fact XML). If it's regular HTML on the other hand, you'd have to use an SGML parser instead.

You 2010-08-14 14:43:22

It's says it's `<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">`

Mark Payton 2010-08-14 14:47:59

An XML parser should be sufficient in that case. Assuming it's actually *valid* XHTML.

You 2010-08-14 14:59:15

Answer 2

A:

Well, it seems it's not valid XHTML. Is there maybe some way to tidy HTML chunks?

Mark Payton 2010-08-14 17:35:25

Answer 3

+1 A:

libxml2 has a html parser which supports malformed/broken html. Please check the link here.

Praveen S 2010-08-15 09:12:17

ansaurus

tags:

views:

answers:

libxml2 HTML chunk parsing

related questions