ansaurus

Question

Help with PHP Regular Expressions using a Negative Look Behind

Answer 1

A:

The problem has been that I cannot seem to match on solely the one open td with the missing </th> preceeding it - but rather it seems to match on several of the open td tags.

Sounds like you want the 'non-greedy' or 'lazy' match expressions. Use '*?' and '+?' instead of '*' and '+', and it will grab as few characters as it can to get a match, rather than as many as it can.

Tim Sylvester 2009-08-03 22:50:11

Thanks Alan. I tried adding a ? in the appropriate places, but it didn't seem to make a difference.

John 2009-08-04 14:43:19

Answer 2

+2 A:

Hi,

Writing my comment to your question, I was thinking "there's definitly got to be another solution that doesn't involve some kind of regex that will become impossible to maintain"...

Maybe I've found a way ; take a look at

The manual of the first one states (quoting) :

Unlike loading XML, HTML does not have to be well-formed to load.

And the manual of the second one says :

Creates an HTML document from the DOM representation.

Trying those with the non-valid-HTML string you provided gives this example :

$str = <<<STRING
<tr>
<th class="ProfileIndent0">
<p>Global pharmaceuticals</p>
<td>197.2</td>
<td>94</td>
</tr>
STRING;

$doc = new DOMDocument();
$doc->loadHTML($str);
echo $doc->saveHTML();

And, when running it (from the command-line, to avoid any trouble with escaping HTML to get it displayed properly), I get :

$ php ./temp.php
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
<html><body><tr>
<th class="ProfileIndent0">
<p>Global pharmaceuticals</p>
</th>
<td>197.2</td>
<td>94</td>
</tr></body></html>

Which, re-formatted, gives :

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
    "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
<html>
    <body>
        <tr>
            <th class="ProfileIndent0">
                <p>Global pharmaceuticals</p>
            </th>
            <td>197.2</td>
            <td>94</td>
        </tr>
    </body>
</html>

Not perfect yet, I admit (it did not add any <table> tags, for instance), but, at least, the tags are now closed as the should...

There might be some problems with the DOCTYPE and <html> tags ; you might not want those... Take a look at some comments under the manual page : they might help you ;-)

EDIT after a bit more thought :

Your "full" example generates some warnings ; maybe you can tidy your "HTML" a bit before feeding ot to loadHTML...

Warning: DOMDocument::loadHTML(): Tag co_text invalid in Entity, 
    line: 1 in /home/squale/developpement/tests/temp/temp.php on line 18
Warning: DOMDocument::loadHTML(): Tag text_data invalid in Entity, 
    line: 2 in /home/squale/developpement/tests/temp/temp.php on line 18
Warning: DOMDocument::loadHTML(): htmlParseStartTag: invalid element name in Entity, 
    line: 2 in /home/squale/developpement/tests/temp/temp.php on line 18
Warning: DOMDocument::loadHTML(): Unexpected end tag : table in Entity, 
    line: 10 in /home/squale/developpement/tests/temp/temp.php on line 18

At worse, you could mask those errors, either by using the error_reporting function before and after calling the function, or using the @ operator...
I wouldn't generally recommend those, however : using those should be in extreme cases -- maybe this one ^^

Still, the result is not looking to bad, actually :

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
    "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
<html>
<body>
    <co_text text_type_id="6">
        <text_data>
            <tr>
                <th class="TableHead" colspan="21">2008 Sales</th> 
            </tr>
            <tr>
                <th class="ProfileIndent0"></th> 
                <th class="ProfileHead">$ mil.</th> 
                <th class="ProfileHead">% of total</th> 
            </tr>
            <tr>
                <th class="ProfileIndent0"> <p>Global pharmaceuticals</p> </th>
                <td>197.2</td> 
                <td>94</td> 
            </tr>
            <tr>
                <th class="ProfileIndent0">Impax pharmaceuticals</th> 
                <td>12.9</td> 
                <td>6</td> 
            </tr>
            <tr>
                <th class="ProfileTotal">Total</th> 
                <td class="ProfileDataTotal">210.1</td> 
                <td class="ProfileDataTotal">100</td> 
            </tr>
            <h3>Selected Generic Products</h3>
            <ul class="prodoplist">
                <li>Anagrelide hydrochloride (generic Agrylin, thrombocytosis)</li>
                <li>Bupropion hydr ochloride (generic Wellbutrin SR, depression)</li>
                <li>Colestipol hydrochloride (generic Colestid, high cholesterol)</li>
                <li>Dantrolene sodium (generic  Dantrium, spasticity)</li>
                <li>Metformin Hcl (generic Glucophage XR, diabetes)</li>
                <li>Nadolol/Bendroflumethiazide (generic Corzide, hypertension)</li>
                <li>Oxybutynin chloride (generic Ditropan XL, urinary incontinence, with Teva)</li>
                <li>Oxycodone hydrochloride (generic OxyContin controlled release,  pain)</li>
                <li>Pilocarpine hydrochlorine (generic Salagen, dry mouth caused by radiation therapy)</li>
            </ul>
        ]]&gt;
        </text_data>
    </co_text>
</body>
</html>

To conclude, as others already suggested, a real HTML tidyier/purifier might be able to help ;-)

Pascal MARTIN 2009-08-03 22:51:01

+1 - I would be inclined to tidy the output with a proper tidier, rather than reinvent the wheel on a regex expression that would be, as someone else mentioned, difficult to maintain.

EvilChookie 2009-08-03 22:53:32

Thanks very much Pascal. I had previously attempted running the string through the PHP Tidy functions, but strangely Tidy wrongly tries to close the th tag by incorrectly wrapping the entire row with the th tag all the way down past the unordered list.

John 2009-08-04 14:53:47

Answer 3

A:

You might also be able to use something like HTMLTidy or HTML Purifier to automatically fix your HTML.

Tim Sylvester 2009-08-03 22:53:38

Answer 4

A:

This regex is working for me:

$text = preg_replace('@<th([^>]*)>(.*<\/td>)(<\/th>)?@','<th$1>$2</th>',$text);

Note that It work for single line rows only. I mean, it work for:

<tr><th><td>some</td></tr>

but not for:

<tr><th>
<td>some</td>
</tr>

I really don't know how to make it work with the "s" modifier. If someone could explain me I appreciate.

Here is my example:

<?php
$html = '<CO_TEXT text_type_id="6">
        <TEXT_DATA><![CDATA[<table class="ProfileChart"> <tr> <th class="TableHead" colspan="21">2008 Sales</th> </tr>

<tr> <th class="ProfileIndent0"></th> <th class="ProfileHead">$ mil.</th> <th class="ProfileHead">% of total</th> </tr>

<tr> <th class="ProfileIndent0"> <p>Global pharmaceuticals</p> <td>197.2</td> <td>94</td> </tr>

<tr> <th class="ProfileIndent0">Impax pharmaceuticals</th> <td>12.9</td> <td>6</td> </tr>

<tr> <th class="ProfileTotal">Total</th> <td class="ProfileDataTotal">210.1</td> <td class="ProfileDataTotal">100</td> </tr> </table><h3>Selected Generic Products</h3><ul class="prodoplist"><li>Anagrelide hydrochloride (generic Agrylin, thrombocytosis)</li><li>Bupropion hydr ochloride (generic Wellbutrin SR, depression)</li><li>Colestipol hydrochloride (generic Colestid, high cholesterol)</li><li>Dantrolene sodium (generic  Dantrium, spasticity)</li><li>Metformin Hcl (generic Glucophage XR, diabetes)</li><li>Nadolol/Bendroflumethiazide (generic Corzide, hypertension)</li
><li>Oxybutynin chloride (generic Ditropan XL, urinary incontinence, with Teva)</li><li>Oxycodone hydrochloride (generic OxyContin controlled release,  pain)</li><li>Pilocarpine hydrochlorine (generic Salagen, dry mouth caused by radiation therapy)</li></ul>]]></TEXT_DATA> </CO_TEXT>';

$text = preg_replace('@<th([^>]*)>(.*<\/td>)(<\/th>)?@s','<th$1>$2</th>',$html);
echo $text;
?>

output:

<CO_TEXT text_type_id="6">
        <TEXT_DATA><![CDATA[<table class="ProfileChart"> <tr> <th class="TableHead" colspan="21">2008 Sales</th> </tr>

<tr> <th class="ProfileIndent0"></th> <th class="ProfileHead">$ mil.</th> <th class="ProfileHead">% of total</th> </tr>

<tr> <th class="ProfileIndent0"> <p>Global pharmaceuticals</p> <td>197.2</td> <td>94</td> </tr>

<tr> <th class="ProfileIndent0">Impax pharmaceuticals</th> <td>12.9</td> <td>6</td> </tr>

<tr> <th class="ProfileTotal">Total</th> <td class="ProfileDataTotal">210.1</td> <td class="ProfileDataTotal">100</td></th> </tr> </table><h3>Selected Generic Products</h3><ul class="prodoplist"><li>Anagrelide hydrochloride (generic Agrylin, thrombocytosis)</li><li>Bupropion hydr ochloride (generic Wellbutrin SR, depression)</li><li>Colestipol hydrochloride (generic Colestid, high cholesterol)</li><li>Dantrolene sodium (generic  Dantrium, spasticity)</li><li>Metformin Hcl (generic Glucophage XR, diabetes)</li><li>Nadolol/Bendroflumethiazide (generic Corzide, hypertension)</li
><li>Oxybutynin chloride (generic Ditropan XL, urinary incontinence, with Teva)</li><li>Oxycodone hydrochloride (generic OxyContin controlled release,  pain)</li><li>Pilocarpine hydrochlorine (generic Salagen, dry mouth caused by radiation therapy)</li></ul>]]></TEXT_DATA> </CO_TEXT>

inakiabt 2009-08-04 18:44:56

ansaurus

tags:

views:

answers:

Help with PHP Regular Expressions using a Negative Look Behind

related questions