views:

440

answers:

5

Hi

I am having some trouble with this regex:

<img(.+)src="_image/([0-9]*)/(.+)/>

Global and case insensitive flags is on.

The problem is that it also grabs Image n (see string below), but I want it only to match the image tags in the string.

<p>Image 1:<img width="199" src="_image/12/label" alt=""/> Image 2: <img width="199" src="_image/12/label" alt=""/><img width="199" src="_image/12/label" alt=""/></p>

It works if I put a newline before Image n :)

Can anyone point out for me what I am doing wrong?

Thanks in advance bob

A: 

Have you tried lazy evaluation? That worked sometime back when I tried something similar.

dirkgently
fundamentally right, but there are also other errors with the regex presented.
BenAlabaster
A: 

Use a non-greedy regexp:

<img .*? src="image/(\d+)/(.+?)/.?>

This won't work because it'll keep reading everything in the second group until the closure of the tag, his second group will not be what he's after
BenAlabaster
A: 

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

Chas. Owens
They're not bad at parsing anything if you use them properly. Sure it *can* be hard if you don't know what you're doing - but hard and bad are two different things.
BenAlabaster
Regular Expressions cannot parse HTML (unless they include some of the extensions Perl has added that make them no longer regular). It isn't a matter of hard, it is flat out impossible. To parse HTML you need recursion and regexes cannot recurse (again unless you are using some of Perl's extensions). Even if you have a dialect that can handle recursion, the regex you need to write to correctly handle HTML is way more complicated than using a parser, so why would you do that to yourself? Of course, you did say "use them properly" and since using them on HTML is wrong, you are still right.
Chas. Owens
By the way balabaster, the first link is aimed directly at people like you who cling to the thought that a regex can handle HTML or XML, so read it carefully and try to come up with a regex than can handle the cases brought up in it. Then go an read the second link and see how easy it is with a parser.
Chas. Owens
A: 

You're using a greedy quantifier (+) without much restriction. A greedy quantifier is telling the regex engine: "Grab every character that qualifies and only back off enough to complete the regex." That means that it will get from the first sequence of the characters "image/nnnnnn/something/".

Axeman
+1  A: 

If I interpret your regex correctly, it looks like you're after the directory name in the first group and the file path in the second group?

<IMG.*?SRC="/_image/(\d+?)/([^"]*?)".*?/>

Don't forget to use the regex options CaseInsensitive which wraps the regex with (?i:[regex])

In the second group, you're parsing everything that is not the closing ", right now you're looking for all characters, in fact, you don't need to search all characters, you want everything that isn't the closing quote on the string.

Also, don't forget to close your SRC string which you're missing, and that the SRC attribute may not be the last in the tag - for instance border, width, height etc. Also, there may be any number of spaces after the closure of the last attribute and the end of tag />

From this regex, your first match group will hold the subdirectory name and the second match group will hold everything after the / of the subdirectory - including nested subdirectories. If you've got nested subdirectories, you may need to expand this slightly:

<IMG.*?SRC="/_image/((\d+?)/)+?([^"]*?)".*?/>

In this case, each of the leading groups will hold each of the nested directory names, and the last group will hold the file name.

BenAlabaster