ansaurus

Question

What is wrong with this regular expression?

Answer 1

+6 A:

<IMG\s([^"'>]+|'[^']*'|"[^"]*")+>

Taking out a couple of branches, the start and the end:

([^"'>]+)+

How many ways can this match "hello"?

(hell)(o)
(hel)(lo)
(hel)(l)(o)
(he)(llo)
(he)(l)(lo)
(he)(l)(l)(o)
... and so on

Greg 2009-03-10 18:53:39

I had thoughts along this line as well but could not think of how to avoid them and still have the same logic.

woaksie 2009-03-10 19:00:45

Ouch. I didn't notice the first `+`.

Konrad Rudolph 2009-03-10 19:23:56

Answer 2

+1 A:

Sounds like one of the situations where the regex engine is backtracking a lot. Mastering Regular Expressions by Friedl has some good material on the topic.

Conor McDermottroe 2009-03-10 18:54:38

Answer 3

A:

Other commenters have mentioned the complexity being the likely cause for the perfo problem. I'd add that if you're trying to match something resembling an IMG tag, I think you want a regex more like this:

<IMG(\s+[a-z]+=('[^']*'|"[^"]*"|[^\s'">]+))+>

Of course, there are still valid HTML variations that this regex won't catch. Like a closing / (required in xhtml), or whitespace before the closing bracket. And it will pass some invalid cases, like unsupported attribute names.

Kip 2009-03-10 18:58:48

This does not catch the unquoted value which I need to do in this case.

woaksie 2009-03-10 19:07:20

ok i just edited and added the case for unquoted attribute.

Kip 2009-03-10 19:12:00

Thanks, I got this to work. I needed to allow for more white space and the /> but it catches the IMG tags that are well formed and fails on those that are not.It is the = before trying to catch the unquoted values that stops Regex going nuts with branching/backtracking.

woaksie 2009-03-10 20:16:13

Answer 4

A:

I think this is what you were trying for, I think the cause of your long running is as mentioned elsewhere, due to extreme repetition caused by the greedy grab for non-quote or > being or-ed with the string processors (also using greedy ["'>] matching.

This seems to run swiftly with either correctly formatted or incorrectly formatted tags.

<img(\s+((\w+)=(('[^']*?')|("[^"]*?"))))+? />

Lazarus 2009-03-10 19:21:05

Answer 5

A:

Could you post what you are exactly trying to find or extract? Do you want to figure out what the img tag points to? That would greatly increase the chances of being able to provide a better answer.

geoffrobinson 2009-03-10 20:21:15

I just want to extract all IMG tags. I am not actually changing them or parsing the contents of the IMG tag its self.

woaksie 2009-03-10 20:44:06

ansaurus

tags:

views:

answers:

What is wrong with this regular expression?

related questions