tags:

views:

310

answers:

4

Hi

I am new to Regex, however I decided it was the easiest route to what I needed to do. Basically I have a string (in PHP) which contains a whole load of HTML code... I want to remove any tags which have style=display:none...

so for example

<img src="" style="display:none" />

<img src="" style="width:11px;display: none" >

etc...

So far my Regex is:

<img.*style=.*display.*:.*none;.* >

But that seems to leave bits of html behind and also take the next element away when used in php with preg_replace.

A: 

Because <img> doesn't allow any other elements inside it, this is possible; but in general, regexp is a thoroughly bad tool for parsing a recursively defined language like HTML.

Anyway, the problem you're probably hitting is that the closing > is being matched by one of the .* expressions, and there happens to be a later > on the line to match your explicit > .

If you replace all your .* by [^>]* that will prevent that. (They probably don't all need to be replaced, but you might as well).

Colin Fine
+1  A: 
$html = preg_replace("/<img[^>]+style[^>]+none[^>]+>/", '', $html);
Anatoly Orlov
thanks works great... no idea how you came up with it but works!
Mark Milford
this will match any IMG elements with any css attribute in style containing the word "none", including `border-style:none;`
Gordon
Gordon: Yes, y're right. it's easy to modify:$html = preg_replace("/<img[^>]+style[^>]display:\s*none[^>]+>/", '', $html);
Anatoly Orlov
`<img style="width:11px;" title="Use display:none to hide stuff" src="Dont-Parse-Html-With-Regex.jpg"/>`
Amarghosh
A: 

Your regular expression is way too broad; .* means "match anything", so this would match:

<img src="foo.png" style="something">Some random displayed text : foo none; bar<br>

At the very least, you probably want to exclude closing brackets from your matches, so [^>]* instead of .*. You also might want to read this, though, and look into using something that actually understands HTML, like DOMDocument

Michael Mrozek
+3  A: 

Like Michael pointed out, you don't want to use Regex for this purpose. A Regex does not know what an element tag is. <foo> is as meaningful as >foo< unless you teach it the difference. Teaching the difference is incredibly tedious though.

DOM is so much more convenient:

$html = <<< HTML
<img src="" style="display:none" />
<IMG src="" style="width:11px;display: none" >
<img src="" style="width:11px" >
HTML;

The above is our (invalid) markup. We feed it to DOM like this:

$dom = new DOMDocument();
$dom->loadHtml($html);
$dom->normalizeDocument();

Now we query the DOM for all "IMG" elements containing a "style" attribute that contains the text "display". We could query for "display: none" in the XPath, but our input markup has occurences with no space inbetween:

$xpath = new DOMXPath($dom);
foreach($xpath->query('//img[contains(@style, "display")]') as $node) {
    $style = str_replace(' ', '', $node->getAttribute('style'));
    if(strpos($style, 'display:none') !== FALSE) {
        $node->parentNode->removeChild($node);
    }
}

We iterate over the IMG nodes and remove all whitespace from their style attribute content. Then we check if it contains "display:none" and if so, remove the element from the DOM.

Now we only need to save our HTML:

echo $dom->saveHTML();

gives us:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
<html><body><img src="" style="width:11px"></body></html>

Screw Regex!


Addendum: you might also be interested in Parsing XML documents with CSS selectors

Gordon
thanks, didn't realize there was a dom parse built into php (although I should have guessed there is a function for everything else)... your suggestion has worked, even with unusual images...
Mark Milford
Something to note with the above, after testing for some time it doesn't work if the 'display' is capital... use:[contains(translate(@style, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), "display")] for the xpath instead
Mark Milford
@Mark you could also use http://de.php.net/manual/en/domxpath.registerphpfunctions.php and use `strotolower` or `stripos`
Gordon