tags:

views:

52

answers:

2

I have a script in PHP which removes empty paragraphs from an HTML file. The empty paragraphs are those <p></p> elements without textContent.

HTML File with Empty Paragraphs:

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<!--
This page is used with remove_empty_paragraphs.php script.
This page contains empty paragraphs. The script removes the empty paragraphs and
writes a new HTML file.
-->
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title></title>
    </head>
    <body>
        <p>This is a paragraph.</p>
        <!-- Below is an empty paragraph. -->
        <p><span></span></p>
        <p>This is another paragraph.</p>
        <!-- Below is another empty paragraph. -->
        <p class=MsoNormal><b></b></p>
        <p style=''></p>
        <p>
            <span lang=EN-US style='font-size:5.0pt;color:navy;mso-ansi-language:EN-`US'></span>
        </p>
    </body>
</html>

First Attempt:

$html = new DOMDocument("1.0", "UTF-8");
$html->loadHTMLFile("HTML File with Empty Paragraphs.html");
$pars = $html->getElementsByTagName("p");

/* removeChild foreach-loop */
foreach ($pars as $par) {
    if ($par->textContent == "") {
        $par->parentNode->removeChild($par);
    }
}

$html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");

This succeeds to:

  • remove empty paragraphs without the style-attribute,

but fails to:

  • remove empty paragraphs with the style-attribute.

So I insert the removeStyleAttribute foreach-loop before the removeChild foreach-loop. (I do not mind removing the style-attributes of nonempty paragraphs.)

Second Attempt:

$html = new DOMDocument("1.0", "UTF-8");
$html->loadHTMLFile("HTML File with Empty Paragraphs.html");
$pars = $html->getElementsByTagName("p");    

/* removeStyleAttribute foreach-loop */
foreach ($pars as $par) {
    if ($par->hasAttribute("style")) {
        $par->removeAttribute("style");
    }
}

/* removeChild foreach-loop */
foreach ($pars as $par) {
    if ($par->textContent == "") {
            $par->parentNode->removeChild($par);
    }
}

$html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");

This succeeds in:

  • removing the style-attributes from empty paragraphs which have the style attribute.
  • removing empty paragraphs that do not have the style-attributes.

But fails! to:

  • remove those empty paragraphs from which the style-attributes were removed.

So I have to have two removeChild foreach-loops, one after the other.

Third Attempt:

$html = new DOMDocument("1.0", "UTF-8");
$html->loadHTMLFile("HTML File with Empty Paragraphs.html");
$pars = $html->getElementsByTagName("p");

/* removeStyleAttribute foreach-loop */
foreach ($pars as $par) {
    if ($par->hasAttribute("style")) {
        $par->removeAttribute("style");
    }
}

/* First removeChild foreach-loop */
foreach ($pars as $par) {
    if ($par->textContent == "") {
        $par->parentNode->removeChild($par);
    }
}

/* Second removeChild foreach-loop, identical to the first removeChild foreach-loop */
foreach ($pars as $par) {
    if ($par->textContent == "") {
        $par->parentNode->removeChild($par);
    }
}

$html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");

This works perfectly!, but it is weird to have two identical loops, one right after the other.

I also tried to use only one loop for everything.

Fourth Attempt:

$html = new DOMDocument("1.0", "UTF-8");
$html->loadHTMLFile("HTML File with Empty Paragraphs.html");
$pars = $html->getElementsByTagName("p");  

foreach ($pars as $par) {
    if ($par->textContent == "") {
        if ($par->hasAttribute("style")){
            $par->removeAttribute("style");
        }
        $par->parentNode->removeChild($par);
    }
}

$html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");

This succeeds to:

  • remove empty paragraphs without the style-attribute,

but fails to:

  • remove the style-attribute from empty paragraphs that have it.
  • remove empty paragraphs with the style attribute.
A: 

Like Tomalak says it might have something to do with the whitespace. Try disabling "preserveWhiteSpace":

$html->preserveWhiteSpace = false

hmm I'm new here, how do I send in my answer as a comment and not as an answer?

Birk
@Brik, none of the paragraphs had any whitespace. I am using a small test file I made in Netbeans,
Geoffrey Van Wyk
A: 

The list returned by getElementsByTagName is dynamic: removing nodes from the document also removes them from the list. And since foreach doesn't know the list changed, it'll happily move to the next item - which is actually two items down because the DOMNodeList was rearranged. Some of the <p> tags were just plain skipped.

Solution: use a for loop (with $pars->item(X) and $pars->length) instead of a foreach, but only increment if a node was not deleted. (Or always increment and backtrack if one was deleted.)

Separately: the last <p> (with the large <span>) wasn't deleted because of the whitespace around the <span>. Use trim() to get rid of it.

See also my reply in http://forums.devnetwork.net/viewtopic.php?f=1&amp;t=121114&amp;p=623974.

tasairis