tags:

views:

837

answers:

4

I need a regex to do the following (unfortunately it has to be a regex, I can't code this because it's working within a purchased product):

I'd like to select all image tags in a chunk of html where either the image tag does not contain a class attribute, or, if it does contain a class attribute, that attribute does not contain a specific string at the beginning. Basically, I want to strip (by matching) all image tags from a chunk of html EXCEPT for images with a particular class applied to them.

This could be two separate regular expressions - I just want to match them - not extract any data.

So, for example, let's say the class I want to keep is called Pretty.

I'd like the regex to match:

<img src="xx"/>
<img border="x" src="xx"/>
<img whatever other attributes src="xx"/>
<img class="ugly" src="xx"/>
<img whatever other attributes class="fugly" src="xx"/>

but not match

<img class="Pretty" src="xx"/>
<img whatever other attributes class="Pretty" src="xx"/>
<img class="Pretty subpretty" src="xx"/>

If it's easier to do in one regex (one to match all image tags without class attribute, and one to match ones with class attributes that aren't 'pretty') that's totally fine too.

+7  A: 

Use XPath instead, as that's what it's for:

//img[not(contains(@class,'Pretty'))]

This XPath expression looks for every img element whose class attribute does not contain the string 'Pretty'. I think it works for elements which are missing the class attribute.

Parsing XML and HTML with regular expressions is usually a very bad idea. Of course, XPath only works if the HTML in question is strict. If it's not a valid XML document then you might want to default back to something else, but even so regex isn't the right tool for the job.

Addendum: I was wrong about getting back to this in 30 minutes. Something came up and I don't have the time to sort it out. If it doesn't work for elements lacking the class attribute, use the following expression:

//img[(not(@class)) or (not(contains(@class,'Pretty')))]
Welbog
+1 except XPath can be used also on non-valid XML using HTML Agility Pack and similar packages
Dror
In XPath, `not()` is a function, not an operator. You need to add/fix your parentheses. http://www.w3.org/TR/xpath.html#function-not
Ben Blank
@Ben: Thanks for the heads up.
Welbog
It's obviously the right tool for the job but I'm not sure the OP's random API let's him do this necessarily...
annakata
+1  A: 

A bit quick and dirty, but it works:

/(?!<img\b[^>]+\bclass="?[^>"]*\bPretty\b)<img\b[^>]*>/

How it works:

<img\b[^>]+\bclass="?[^>"]*\bPretty\b matches all "Pretty" images.

<img\b[^>]*> matches all images. So, put the "Pretty" image subpattern in a negative lookahead in front of the subpattern to match all images. This will then match all images, minus those that match the pretty subpattern.

ʞɔıu
A: 
<img(?:\s+(?:(?!class\b)\w+="[^"]*"|class="(?!Pretty)[^"]*"))*/>

That seems to answer your question, but there are many details you didn't address, like:

  • Are the tag- and attribute names consistently lowercase?

  • What if the class name starts with "pretty" (i.e., is it case sensitive)?

  • Are attribute values always quoted, and always with double-quotes?

  • Will there ever be extra whitespace, like around the "=" or before the final "/>"?

  • Does your "purchased tool" support regexes with negative lookaheads?

Alan Moore
A: 

Yes, to all those who suggested I would be better off using something other than regex you are of course right, but i guess you missed the first sentence in the question.

I ended up finding the solution, nick's and alan M's look the closest to that, thanks guys! Fortunately I can use negative lookahead's so it works perfectly :)