views:

434

answers:

4

Hello,

I'm trying to write a simple function to close missing HTML tags using PHP preg_replace.

I thought it would be relatively straight-forward, but for some reason it hasn't been.

What I'm basically trying to do is close a missing tag in the following row:

<tr>
<th class="ProfileIndent0">
<p>Global pharmaceuticals</p>
<td>197.2</td>
<td>94</td>
</tr>

The approach I've been taking is to use a negative look behind to find opening td tags that are not preceded by opened th and properly closed th tags.

For example :

$text = preg_replace('!<th(\s\S*){0,1}?>(.*)((?<!<\/th>)[\s]*<td>)!U','<th$1>$2</th>',$text);

I've written the regular expression pattern countless different ways to no avail. The problem has been that I cannot seem to match on solely the one open td with the missing /th preceeding it - but rather it seems to match on several of the open td tags.

Here's the complete input text:

<CO_TEXT text_type_id="6">
        <TEXT_DATA><![CDATA[<table class="ProfileChart"> <tr> <th class="TableHead" colspan="21">2008 Sales</th> </tr>

<tr> <th class="ProfileIndent0"></th> <th class="ProfileHead">$ mil.</th> <th class="ProfileHead">% of total</th> </tr>

<tr> <th class="ProfileIndent0"> <p>Global pharmaceuticals</p> <td>197.2</td> <td>94</td> </tr>

<tr> <th class="ProfileIndent0">Impax pharmaceuticals</th> <td>12.9</td> <td>6</td> </tr>

<tr> <th class="ProfileTotal">Total</th> <td class="ProfileDataTotal">210.1</td> <td class="ProfileDataTotal">100</td> </tr> </table><h3>Selected Generic Products</h3><ul class="prodoplist"><li>Anagrelide hydrochloride (generic Agrylin, thrombocytosis)</li><li>Bupropion hydr ochloride (generic Wellbutrin SR, depression)</li><li>Colestipol hydrochloride (generic Colestid, high cholesterol)</li><li>Dantrolene sodium (generic  Dantrium, spasticity)</li><li>Metformin Hcl (generic Glucophage XR, diabetes)</li><li>Nadolol/Bendroflumethiazide (generic Corzide, hypertension)</li
><li>Oxybutynin chloride (generic Ditropan XL, urinary incontinence, with Teva)</li><li>Oxycodone hydrochloride (generic OxyContin controlled release,  pain)</li><li>Pilocarpine hydrochlorine (generic Salagen, dry mouth caused by radiation therapy)</li></ul>]]></TEXT_DATA> </CO_TEXT>

Is there something going on with negative look behinds in PHP that I'm not aware of, or have I just not hit on the right matching pattern?

Any help would be much appreciated.

Thanks, John

A: 

The problem has been that I cannot seem to match on solely the one open td with the missing </th> preceeding it - but rather it seems to match on several of the open td tags.

Sounds like you want the 'non-greedy' or 'lazy' match expressions. Use '*?' and '+?' instead of '*' and '+', and it will grab as few characters as it can to get a match, rather than as many as it can.

Tim Sylvester
Thanks Alan. I tried adding a ? in the appropriate places, but it didn't seem to make a difference.
John
+2  A: 

Hi,

Writing my comment to your question, I was thinking "there's definitly got to be another solution that doesn't involve some kind of regex that will become impossible to maintain"...

Maybe I've found a way ; take a look at

The manual of the first one states (quoting) :

Unlike loading XML, HTML does not have to be well-formed to load.

And the manual of the second one says :

Creates an HTML document from the DOM representation.


Trying those with the non-valid-HTML string you provided gives this example :

$str = <<<STRING
<tr>
<th class="ProfileIndent0">
<p>Global pharmaceuticals</p>
<td>197.2</td>
<td>94</td>
</tr>
STRING;

$doc = new DOMDocument();
$doc->loadHTML($str);
echo $doc->saveHTML();

And, when running it (from the command-line, to avoid any trouble with escaping HTML to get it displayed properly), I get :

$ php ./temp.php
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
<html><body><tr>
<th class="ProfileIndent0">
<p>Global pharmaceuticals</p>
</th>
<td>197.2</td>
<td>94</td>
</tr></body></html>

Which, re-formatted, gives :

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
    "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
<html>
    <body>
        <tr>
            <th class="ProfileIndent0">
                <p>Global pharmaceuticals</p>
            </th>
            <td>197.2</td>
            <td>94</td>
        </tr>
    </body>
</html>

Not perfect yet, I admit (it did not add any <table> tags, for instance), but, at least, the tags are now closed as the should...

There might be some problems with the DOCTYPE and <html> tags ; you might not want those... Take a look at some comments under the manual page : they might help you ;-)



EDIT after a bit more thought :

Your "full" example generates some warnings ; maybe you can tidy your "HTML" a bit before feeding ot to loadHTML...

Warning: DOMDocument::loadHTML(): Tag co_text invalid in Entity, 
    line: 1 in /home/squale/developpement/tests/temp/temp.php on line 18
Warning: DOMDocument::loadHTML(): Tag text_data invalid in Entity, 
    line: 2 in /home/squale/developpement/tests/temp/temp.php on line 18
Warning: DOMDocument::loadHTML(): htmlParseStartTag: invalid element name in Entity, 
    line: 2 in /home/squale/developpement/tests/temp/temp.php on line 18
Warning: DOMDocument::loadHTML(): Unexpected end tag : table in Entity, 
    line: 10 in /home/squale/developpement/tests/temp/temp.php on line 18

At worse, you could mask those errors, either by using the error_reporting function before and after calling the function, or using the @ operator...
I wouldn't generally recommend those, however : using those should be in extreme cases -- maybe this one ^^

Still, the result is not looking to bad, actually :

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
    "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
<html>
<body>
    <co_text text_type_id="6">
        <text_data>
            <tr>
                <th class="TableHead" colspan="21">2008 Sales</th> 
            </tr>
            <tr>
                <th class="ProfileIndent0"></th> 
                <th class="ProfileHead">$ mil.</th> 
                <th class="ProfileHead">% of total</th> 
            </tr>
            <tr>
                <th class="ProfileIndent0"> <p>Global pharmaceuticals</p> </th>
                <td>197.2</td> 
                <td>94</td> 
            </tr>
            <tr>
                <th class="ProfileIndent0">Impax pharmaceuticals</th> 
                <td>12.9</td> 
                <td>6</td> 
            </tr>
            <tr>
                <th class="ProfileTotal">Total</th> 
                <td class="ProfileDataTotal">210.1</td> 
                <td class="ProfileDataTotal">100</td> 
            </tr>
            <h3>Selected Generic Products</h3>
            <ul class="prodoplist">
                <li>Anagrelide hydrochloride (generic Agrylin, thrombocytosis)</li>
                <li>Bupropion hydr ochloride (generic Wellbutrin SR, depression)</li>
                <li>Colestipol hydrochloride (generic Colestid, high cholesterol)</li>
                <li>Dantrolene sodium (generic  Dantrium, spasticity)</li>
                <li>Metformin Hcl (generic Glucophage XR, diabetes)</li>
                <li>Nadolol/Bendroflumethiazide (generic Corzide, hypertension)</li>
                <li>Oxybutynin chloride (generic Ditropan XL, urinary incontinence, with Teva)</li>
                <li>Oxycodone hydrochloride (generic OxyContin controlled release,  pain)</li>
                <li>Pilocarpine hydrochlorine (generic Salagen, dry mouth caused by radiation therapy)</li>
            </ul>
        ]]&gt;
        </text_data>
    </co_text>
</body>
</html>


To conclude, as others already suggested, a real HTML tidyier/purifier might be able to help ;-)

Pascal MARTIN
+1 - I would be inclined to tidy the output with a proper tidier, rather than reinvent the wheel on a regex expression that would be, as someone else mentioned, difficult to maintain.
EvilChookie
Thanks very much Pascal. I had previously attempted running the string through the PHP Tidy functions, but strangely Tidy wrongly tries to close the th tag by incorrectly wrapping the entire row with the th tag all the way down past the unordered list.
John
A: 

You might also be able to use something like HTMLTidy or HTML Purifier to automatically fix your HTML.

Tim Sylvester
A: 

This regex is working for me:

$text = preg_replace('@<th([^>]*)>(.*<\/td>)(<\/th>)?@','<th$1>$2</th>',$text);

Note that It work for single line rows only. I mean, it work for:

<tr><th><td>some</td></tr>

but not for:

<tr><th>
<td>some</td>
</tr>

I really don't know how to make it work with the "s" modifier. If someone could explain me I appreciate.

Here is my example:

<?php
$html = '<CO_TEXT text_type_id="6">
        <TEXT_DATA><![CDATA[<table class="ProfileChart"> <tr> <th class="TableHead" colspan="21">2008 Sales</th> </tr>

<tr> <th class="ProfileIndent0"></th> <th class="ProfileHead">$ mil.</th> <th class="ProfileHead">% of total</th> </tr>

<tr> <th class="ProfileIndent0"> <p>Global pharmaceuticals</p> <td>197.2</td> <td>94</td> </tr>

<tr> <th class="ProfileIndent0">Impax pharmaceuticals</th> <td>12.9</td> <td>6</td> </tr>

<tr> <th class="ProfileTotal">Total</th> <td class="ProfileDataTotal">210.1</td> <td class="ProfileDataTotal">100</td> </tr> </table><h3>Selected Generic Products</h3><ul class="prodoplist"><li>Anagrelide hydrochloride (generic Agrylin, thrombocytosis)</li><li>Bupropion hydr ochloride (generic Wellbutrin SR, depression)</li><li>Colestipol hydrochloride (generic Colestid, high cholesterol)</li><li>Dantrolene sodium (generic  Dantrium, spasticity)</li><li>Metformin Hcl (generic Glucophage XR, diabetes)</li><li>Nadolol/Bendroflumethiazide (generic Corzide, hypertension)</li
><li>Oxybutynin chloride (generic Ditropan XL, urinary incontinence, with Teva)</li><li>Oxycodone hydrochloride (generic OxyContin controlled release,  pain)</li><li>Pilocarpine hydrochlorine (generic Salagen, dry mouth caused by radiation therapy)</li></ul>]]></TEXT_DATA> </CO_TEXT>';

$text = preg_replace('@<th([^>]*)>(.*<\/td>)(<\/th>)?@s','<th$1>$2</th>',$html);
echo $text;
?>

output:

<CO_TEXT text_type_id="6">
        <TEXT_DATA><![CDATA[<table class="ProfileChart"> <tr> <th class="TableHead" colspan="21">2008 Sales</th> </tr>

<tr> <th class="ProfileIndent0"></th> <th class="ProfileHead">$ mil.</th> <th class="ProfileHead">% of total</th> </tr>

<tr> <th class="ProfileIndent0"> <p>Global pharmaceuticals</p> <td>197.2</td> <td>94</td> </tr>

<tr> <th class="ProfileIndent0">Impax pharmaceuticals</th> <td>12.9</td> <td>6</td> </tr>

<tr> <th class="ProfileTotal">Total</th> <td class="ProfileDataTotal">210.1</td> <td class="ProfileDataTotal">100</td></th> </tr> </table><h3>Selected Generic Products</h3><ul class="prodoplist"><li>Anagrelide hydrochloride (generic Agrylin, thrombocytosis)</li><li>Bupropion hydr ochloride (generic Wellbutrin SR, depression)</li><li>Colestipol hydrochloride (generic Colestid, high cholesterol)</li><li>Dantrolene sodium (generic  Dantrium, spasticity)</li><li>Metformin Hcl (generic Glucophage XR, diabetes)</li><li>Nadolol/Bendroflumethiazide (generic Corzide, hypertension)</li
><li>Oxybutynin chloride (generic Ditropan XL, urinary incontinence, with Teva)</li><li>Oxycodone hydrochloride (generic OxyContin controlled release,  pain)</li><li>Pilocarpine hydrochlorine (generic Salagen, dry mouth caused by radiation therapy)</li></ul>]]></TEXT_DATA> </CO_TEXT>
inakiabt