tags:

views:

123

answers:

4

I want to get a url from a string. Heres my code to extract an img url.

        var imgReg = new Regex("img\\s*src\\s*=\\s*\"(.*)\"");
        string imgLink = imgReg.Match(page, l, r - l).Groups[1].Value;

The result was

http://url.com/file.png" border="0" alt="

How do i fix this so it ends at the first "? I tried something like

        var imgReg = new Regex("img\\s*src\\s*=\\s*\"(.*[^\\\"])\"");

But i got the same results as the original.

+4  A: 

Try this:

var imgReg = new Regex(@"img\s+src\s*=\s*""([^""']*)""");

Also, note the "\s+" instead of "\s*" after "img". You need at least one space there.

You can also use the non-greedy (or "lazy") version of the star operator, which, instead of matching as much as possible, would match a little as possible and stop, as you would like, at the first ending quote:

var imgReg = new Regex(@"img\s+src\s*=\s*""(.*?)""");

(note the "?" after ".*")

Ludovic
Can I suggest using an @ quoted string, as this will **vastly** improve readability and maintainability wrt slashes in regular expressions: `@"img\s+src\s*=\s*""([^""]*)"""`
sixlettervariables
Hmmm, also it should be noted that it is valid HTML to use a single quote to quote attributes, although if the OP knows this isn't the case they can safely ignore that possibility.
sixlettervariables
Yes, @ strings are better. Good suggestion.
Ludovic
*? is excellent and i had no idea "text""" was legal
acidzombie24
A: 

What it looks like to me is, your (*.) is catching the double quotes you don't want to match.

You can do """ to match a double quote, or do something like this for your link matching

Match(input, @"http://(\w./)+.png");

f0ster
+3  A: 

Please consider using a DOM (such as the Html Agility Pack) to parse HTML rather than using regular expressions. A DOM should handle all edge cases; regular expressions won't.

TrueWill
+1  A: 

Your .* is too greedy. Change it to the following and it will select everything up to the next double-quote.

Source Text:  <img src="http://url.com/file.png" border="0" alt="" />
              <img src='http://url.com/file.png' border='0' alt='' />

RegEx:        <img\s*src\s*=\s*[\"\']([^\"\']+)[\"\']

I just changed the (.*) to ([^"]+). This means that you'll grab every non-double-quote character up to the next part of the regex. It also supports single- or double-quotes.

EndangeredMassa