views:

1441

answers:

7

I have a bunch of legacy documents that are HTML-like. As in, they look like HTML, but have additional made up tags that aren't a part of HTML

<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>

I need to parse these files. PHP is the only only tool available. The documents don't come close to being well formed XML.

My original thought was to use the loadHTML methods on PHPs DOMDocument. However, these methods choke on the make up HTML tags, and will refuse to parse the string/file.

$oDom = new DomDocument();
$oDom->loadHTML("<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>");
//gives us
DOMDocument::loadHTML() [function.loadHTML]: Tag pseud-template invalid in Entity, line: 1 occured in ....

The only solution I've been able to come up with is to pre-process the files with string replacement functions that will remove the invalid tags and replace them with a valid HTML tag (maybe a span with an id of the tag name).

Is there a more elegant solution? A way to let DOMDocument know about additional tags to consider as valid? Is there a different, robust HTML parsing class/object out there for PHP?

(if it's not obvious, I don't consider regular expressions a valid solution here)

Update: The information in the fake tags is part of the goal here, so something like Tidy isn't an option. Also, I'm after something that does the some level, if not all, of well-formedness cleanup for me, which is why I was looking the DomDocument's loadHTML method in the first place.

+2  A: 

I wonder if passing the "bad" HTML through HTML Tidy might help as a first pass? Might be worth a look, if you can get the document to be well formed, maybe you could load it as a regular XML file with DomDocument.

Paul Dixon
Apologies, I should have been more specific, part of what I need to parse out of the file is what's found in the fake tags.
Alan Storm
I suggested HTMLTidy as a preprocessing step to try and get you well formed XML, then you can parse it with DomDocument and read the whole DOM, with any luck :)
Paul Dixon
Doesn't tidy strip out bogus markup as well as all the reformatting it does?
Alan Storm
A: 

Have you tried tidyHTML (tidy.sourceforge.net) ? I think it's available for PHP and and it's a fairly decent parser

Apologies, I should have been more specific, part of what I need to parse out of the file is what's found in the fake tags.
Alan Storm
+1  A: 

@Twan You don't need a DTD for DOMDocument to parse custom XML. Just use DOMDocument->load(), and as long as the XML is well-formed, it can read it.

Once you get the files to be well-formed, that's when you can start looking at XML parsers, before that you're S.O.L. Lok Alejo said, you could look at HTML TIDY, but it looks like that's specific to HTML, and I don't know how it would go with your custom elements.

I don't consider regular expressions a valid solution here

Until you've got well-formedness, that might be your only option. Once you get the documents to that stage, then you're in the clear with the DOM functions.

nickf
When you load an HTML file with DOMDocument, it appears to do some level of cleanup re: well well-formedness, BUT requires all your tags to be legit HTML tags. I'm looking for something that does the former, but not the later.
Alan Storm
A: 

Take a look at the Parser in the PHP Fit port. The code is clean and was originally designed for loading the dirty HTML saved by Word. It's configured to pull tables out, but can easily be adapated.

You can see the source here: http://gerd.exit0.net/pat/PHPFIT/PHPFIT-0.1.0/Parser.phps

The unit test will show you how to use it: http://gerd.exit0.net/pat/PHPFIT/PHPFIT-0.1.0/test/parser.phps

Ged Byrne
A: 

My quick and dirty solution to this problem was to run a loop that matches my list of custom tags with a regular expression. The regexp doesn't catch tags that have another inner custom tag inside them.

When there is a match, a function to process that tag is called and returns the "processed HTML". If that custom tag was inside another custom tag than the parent becomes childless by the fact that actual HTML was inserted in place of the child, and it will be matched by the regexp and processed at the next iteration of the loop.

The loop ends when there are no childless custom tags to be matched. Overall it's iterative (a while loop) and not recursive.

Gilles
A: 

@Alan Storm

Your comment on my other answer got me to thinking:

When you load an HTML file with DOMDocument, it appears to do some level of cleanup re: well well-formedness, BUT requires all your tags to be legit HTML tags. I'm looking for something that does the former, but not the later. (Alan Storm)

Run a regex (sorry!) over the tags, and when it finds one which isn't a valid HTML element, replace it with a valid element that you know doesn't exist in any of the documents (blink comes to mind...), and give it an attribute value with the name of the illegal element, so that you can switch it back afterwards. eg:

$code = str_replace("<pseudo-tag>", "<blink rel=\"pseudo-tag\">", $code);
// and then back again...
$code = preg_replace('<blink rel="(.*?)">', '<\1>', $code);

obviously that code won't work, but you get the general idea?

nickf
+1  A: 

You can suppress warnings with libxml_use_internal_errors, while loading the document. Eg.:

libxml_use_internal_errors(true);
$doc = new DomDocument();
$doc->loadHTML("<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>");
libxml_use_internal_errors(false);

If, for some reason, you need access to the warnings, use libxml_get_errors

troelskn
You should have waited a few weeks, you could have gotten the "correct answer two years later badge!"
Alan Storm
arh .. now why didn't I know that :)
troelskn