ansaurus

Question

Regex to select all image html tags conditionally on the src value

Answer 1

+7 A:

Use XPath instead, as that's what it's for:

//img[not(contains(@class,'Pretty'))]

This XPath expression looks for every img element whose class attribute does not contain the string 'Pretty'. I think it works for elements which are missing the class attribute.

Parsing XML and HTML with regular expressions is usually a very bad idea. Of course, XPath only works if the HTML in question is strict. If it's not a valid XML document then you might want to default back to something else, but even so regex isn't the right tool for the job.

Addendum: I was wrong about getting back to this in 30 minutes. Something came up and I don't have the time to sort it out. If it doesn't work for elements lacking the class attribute, use the following expression:

//img[(not(@class)) or (not(contains(@class,'Pretty')))]

Welbog 2009-06-03 10:56:43

+1 except XPath can be used also on non-valid XML using HTML Agility Pack and similar packages

Dror 2009-06-03 12:36:34

In XPath, `not()` is a function, not an operator. You need to add/fix your parentheses. http://www.w3.org/TR/xpath.html#function-not

Ben Blank 2009-06-03 13:50:46

@Ben: Thanks for the heads up.

Welbog 2009-06-03 13:55:22

It's obviously the right tool for the job but I'm not sure the OP's random API let's him do this necessarily...

annakata 2009-06-03 14:06:19

Answer 2

+1 A:

A bit quick and dirty, but it works:

/(?!<img\b[^>]+\bclass="?[^>"]*\bPretty\b)<img\b[^>]*>/

How it works:

<img\b[^>]+\bclass="?[^>"]*\bPretty\b matches all "Pretty" images.

<img\b[^>]*> matches all images. So, put the "Pretty" image subpattern in a negative lookahead in front of the subpattern to match all images. This will then match all images, minus those that match the pretty subpattern.

ʞɔıu 2009-06-03 13:56:27

Answer 3

A:

<img(?:\s+(?:(?!class\b)\w+="[^"]*"|class="(?!Pretty)[^"]*"))*/>

That seems to answer your question, but there are many details you didn't address, like:

Are the tag- and attribute names consistently lowercase?
What if the class name starts with "pretty" (i.e., is it case sensitive)?
Are attribute values always quoted, and always with double-quotes?
Will there ever be extra whitespace, like around the "=" or before the final "/>"?
Does your "purchased tool" support regexes with negative lookaheads?

Alan Moore 2009-06-04 03:47:32

Answer 4

A:

Yes, to all those who suggested I would be better off using something other than regex you are of course right, but i guess you missed the first sentence in the question.

I ended up finding the solution, nick's and alan M's look the closest to that, thanks guys! Fortunately I can use negative lookahead's so it works perfectly :)

2009-06-05 10:08:45

ansaurus

tags:

views:

answers:

Regex to select all image html tags conditionally on the src value

related questions