views:

233

answers:

3

In my code I convert some styled xls document to html using openoffice. I then parse the tables using xml_parser_create. The problem is that openoffice creates oldschool html with unclosed <BR> and <HR> tags, it doesn't create doctypes and don't quote attributes <TABLE WIDTH=4>.

The php parsers I know off don't like this, and yield xml formating errors. My current solution is to run some regexes over the file before I parse it, but this is neither nice nor fast.

Do you know a (hopefully included) php-parser, that doesn't care about these kinds of mistakes? Or perhaps a fast way to fix a 'broken' html?

+1  A: 

There is SimpleHTML

For repairing broken HTML, you could use Tidy.

As an alternative you can use the native XML Reader. Because it is acts as a cursor going forward on the document stream and stopping at each node on the way, it will not break on invalid XML documents.

See http://www.ibm.com/developerworks/library/x-pullparsingphp.html

Gordon
+4  A: 

A solution to "fix" broken HTML could be to use HTMLPurifier (quoting) :

HTML Purifier is a standards-compliant HTML filter library written in PHP.
HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant


An alternative idea might be to try loading your HTML with DOMDocument::loadHTML (quoting) :

The function parses the HTML contained in the string source . Unlike loading XML, HTML does not have to be well-formed to load.

And if you're trying to load HTML from a file, see DOMDocument::loadHTMLFile.

Pascal MARTIN
+1 for introduction htmlpurifier.one may look at http://simplehtmldom.sourceforge.net/ too.
takpar
The purifier is nice, but feels like kinda overkill for the problem. Same thing goes for the DOMParser. Is it not correct, that it will require a lot more time and ram than a simple sax parser?
Thomas Ahle
Maybe it will require more RAM, and possibly time ; but it will do more than a simple SAX parse, that would only read data, and not repair it ;;; and I'd say a SAX parser will only be able to read valid XML -- while HTMLPurifier and `DOMDocument::loadHTML` are both able to read "broken" HTML.
Pascal MARTIN
Because my errors are always generated by the same engine, and thus fairly predictable, I've coded the parser using simple regex.I know about http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454 and I am very thankful for pointing me to these two great tools.
Thomas Ahle
If you can "predict" the errors, I guess that's OK :-) You're welcome :-)
Pascal MARTIN
+1  A: 

Any particular reason you're still using the PHP 4 XML API?

If you can get away with using PHP 5's XML API, there are two possibilities.

First, try the built-in HTML parser. It's really not very good (it tends to choke on poorly formatted HTML), but it might do the trick. Have a look at DomDocument::LoadHTML.

Second option - you could try the HTML parser based on the HTML5 parser specification:

http://code.google.com/p/html5lib/

This tends to work better than the built-in PHP HTML parser. It loads the HTML into a DomDocument object.

BlackAura
I'd rather not use a dom parser, as the document is quite big. (And I've already written tons of code for the sax)
Thomas Ahle