I have a script in PHP which removes empty paragraphs from an HTML file. The empty paragraphs are those <p></p>
elements without textContent.
HTML File with Empty Paragraphs:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<!--
This page is used with remove_empty_paragraphs.php script.
This page contains empty paragraphs. The script removes the empty paragraphs and
writes a new HTML file.
-->
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title></title>
</head>
<body>
<p>This is a paragraph.</p>
<!-- Below is an empty paragraph. -->
<p><span></span></p>
<p>This is another paragraph.</p>
<!-- Below is another empty paragraph. -->
<p class=MsoNormal><b></b></p>
<p style=''></p>
<p>
<span lang=EN-US style='font-size:5.0pt;color:navy;mso-ansi-language:EN-`US'></span>
</p>
</body>
</html>
First Attempt:
$html = new DOMDocument("1.0", "UTF-8");
$html->loadHTMLFile("HTML File with Empty Paragraphs.html");
$pars = $html->getElementsByTagName("p");
/* removeChild foreach-loop */
foreach ($pars as $par) {
if ($par->textContent == "") {
$par->parentNode->removeChild($par);
}
}
$html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");
This succeeds to:
- remove empty paragraphs without the style-attribute,
but fails to:
- remove empty paragraphs with the style-attribute.
So I insert the removeStyleAttribute foreach-loop before the removeChild foreach-loop. (I do not mind removing the style-attributes of nonempty paragraphs.)
Second Attempt:
$html = new DOMDocument("1.0", "UTF-8");
$html->loadHTMLFile("HTML File with Empty Paragraphs.html");
$pars = $html->getElementsByTagName("p");
/* removeStyleAttribute foreach-loop */
foreach ($pars as $par) {
if ($par->hasAttribute("style")) {
$par->removeAttribute("style");
}
}
/* removeChild foreach-loop */
foreach ($pars as $par) {
if ($par->textContent == "") {
$par->parentNode->removeChild($par);
}
}
$html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");
This succeeds in:
- removing the style-attributes from empty paragraphs which have the style attribute.
- removing empty paragraphs that do not have the style-attributes.
But fails! to:
- remove those empty paragraphs from which the style-attributes were removed.
So I have to have two removeChild foreach-loops, one after the other.
Third Attempt:
$html = new DOMDocument("1.0", "UTF-8");
$html->loadHTMLFile("HTML File with Empty Paragraphs.html");
$pars = $html->getElementsByTagName("p");
/* removeStyleAttribute foreach-loop */
foreach ($pars as $par) {
if ($par->hasAttribute("style")) {
$par->removeAttribute("style");
}
}
/* First removeChild foreach-loop */
foreach ($pars as $par) {
if ($par->textContent == "") {
$par->parentNode->removeChild($par);
}
}
/* Second removeChild foreach-loop, identical to the first removeChild foreach-loop */
foreach ($pars as $par) {
if ($par->textContent == "") {
$par->parentNode->removeChild($par);
}
}
$html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");
This works perfectly!, but it is weird to have two identical loops, one right after the other.
I also tried to use only one loop for everything.
Fourth Attempt:
$html = new DOMDocument("1.0", "UTF-8");
$html->loadHTMLFile("HTML File with Empty Paragraphs.html");
$pars = $html->getElementsByTagName("p");
foreach ($pars as $par) {
if ($par->textContent == "") {
if ($par->hasAttribute("style")){
$par->removeAttribute("style");
}
$par->parentNode->removeChild($par);
}
}
$html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");
This succeeds to:
- remove empty paragraphs without the style-attribute,
but fails to:
- remove the style-attribute from empty paragraphs that have it.
- remove empty paragraphs with the style attribute.