tags:

views:

423

answers:

4

Hi all

I'm trying to write a regex expression to match the src, width and height attributes on an image tag. The width and height are optional.

I have came up with the following:

(?:<img.*)(?<=src=")(?<src>([\w\s://?=&.]*)?)?(?:.*)(?<height>(?<=height=")\d*)?(?:.*)(?<width>(?<=width=")(\d*)?)?

expresso shows this matching only the src bit for the following html snippet

<img src="myimage.jpg" height="20" />
<img src="anotherImage.gif" width="30"/>

I'm hoping I'm really close and someone here can point out what I'm doing wrong, I have a feeling its my optional in between characters bit (?:.*) i've tried making it non greedy to no success. So any pointers?

+1  A: 

In most regex dialects, .* is "greedy" and will overmatch; use .*? to match "as little as possible" instead.

Alex Martelli
+9  A: 

Using regular expressions to pull values from HTML is always a mistake. HTML syntax is a lot more complex that it may first appear and it's very easy for a page to catch out even a very complex regular expression.

Use an HTML Parser instead.

This question has been asked before and will be asked again. Regular Expressions do seem like a good choice for this problem, but they're not.

Dave Webb
It was far easier to use a HTML Parser, I used HTMLAgilityPack, so much faster and gives you more control. Many Thanks
MJJames
+3  A: 

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

Chas. Owens
+1  A: 

I didn't have a chance to test it, but maybe this will work for you (note that I didn't use named matches):

<img(?:(\s*(src|height|width)\s*=\s*"([^"]+)"\s*)+|[^>]+?)*>
jakemcgraw