tags:

views:

213

answers:

5

This regular expression:

<IMG\s([^"'>]+|'[^']*'|"[^"]*")+>

seems to process endlessly when given this text

<img src=http://www.blahblahblah.com/houses/Images/
    single_and_multi/roof/feb09/01_img_trrnjks_vol2009.jpg' />

I would expect it to - not find a match (quickly) - because there is only one single quote in the text. I have had this happen in C# and also using the Expresso regex tool. If the text is a lot shorter it seems to work.

John.

+6  A: 
<IMG\s([^"'>]+|'[^']*'|"[^"]*")+>

Taking out a couple of branches, the start and the end:

([^"'>]+)+

How many ways can this match "hello"?

(hell)(o)
(hel)(lo)
(hel)(l)(o)
(he)(llo)
(he)(l)(lo)
(he)(l)(l)(o)
... and so on
Greg
I had thoughts along this line as well but could not think of how to avoid them and still have the same logic.
woaksie
Ouch. I didn't notice the first `+`.
Konrad Rudolph
+1  A: 

Sounds like one of the situations where the regex engine is backtracking a lot. Mastering Regular Expressions by Friedl has some good material on the topic.

Conor McDermottroe
A: 

Other commenters have mentioned the complexity being the likely cause for the perfo problem. I'd add that if you're trying to match something resembling an IMG tag, I think you want a regex more like this:

<IMG(\s+[a-z]+=('[^']*'|"[^"]*"|[^\s'">]+))+>

Of course, there are still valid HTML variations that this regex won't catch. Like a closing / (required in xhtml), or whitespace before the closing bracket. And it will pass some invalid cases, like unsupported attribute names.

Kip
This does not catch the unquoted value which I need to do in this case.
woaksie
ok i just edited and added the case for unquoted attribute.
Kip
Thanks, I got this to work. I needed to allow for more white space and the /> but it catches the IMG tags that are well formed and fails on those that are not.It is the = before trying to catch the unquoted values that stops Regex going nuts with branching/backtracking.
woaksie
A: 

I think this is what you were trying for, I think the cause of your long running is as mentioned elsewhere, due to extreme repetition caused by the greedy grab for non-quote or > being or-ed with the string processors (also using greedy ["'>] matching.

This seems to run swiftly with either correctly formatted or incorrectly formatted tags.

<img(\s+((\w+)=(('[^']*?')|("[^"]*?"))))+? />
Lazarus
A: 

Could you post what you are exactly trying to find or extract? Do you want to figure out what the img tag points to? That would greatly increase the chances of being able to provide a better answer.

geoffrobinson
I just want to extract all IMG tags. I am not actually changing them or parsing the contents of the IMG tag its self.
woaksie