In my code I convert some styled xls document to html using openoffice.
I then parse the tables using xml_parser_create
.
The problem is that openoffice creates oldschool html with unclosed <BR>
and <HR>
tags, it doesn't create doctypes and don't quote attributes <TABLE WIDTH=4>
.
The php parsers I know off don't like this, and yield xml formating errors. My current solution is to run some regexes over the file before I parse it, but this is neither nice nor fast.
Do you know a (hopefully included) php-parser, that doesn't care about these kinds of mistakes? Or perhaps a fast way to fix a 'broken' html?