tags:

views:

672

answers:

3

Hi

I want to extract the image url from any website. I am reading the source info through webRequest. I want a regular expression which will fetch the Image url from this content i.e the Src value in the tag.

Thanks

+1  A: 
/(?:\"|')[^\\x22*<>|\\\\]+?\.(?:jpg|bmp|gif|png)(?:\"|')/i

is a decent one I have used before. This gets any reference to an image file within an html document. I didn't strip " or ' around the match, so you will need to do that.

Antony Carthy
Hi I have used Dim reg1 As New Regex("(?<=<img[^<]+?src="")[^""]+", RegexOptions.IgnoreCase)But its not fetching the correct infoCan u suggest something like this or a correction above?
+1  A: 

Try this*:

<img .*?src=["']?([^'">]+)["']?.*?>

Tested here with:

<img class="test" src="/content/img/so/logo.png" alt="logo homepage">

Gives

$1 = /content/img/so/logo.png

The $1 (you have to mouseover the match to see it) corresponds to the part of the regex between (). How you access that value will depend on what implementation of regex you are using.

*If you want to know how this works, leave a comment

EDIT As nearly always with regexp, there are edge cases:

<img title="src=hack" src="/content/img/so/logo.png" alt="logo homepage">

This would be matched as 'hack'.

Benjol
+4  A: 

I'd recommend using an HTML parser to read the html and pull the image tags out of it, as regexes don't mesh well with data structures like xml and html.

In C#: (from this SO question)

    var web = new HtmlWeb();
    var doc = web.Load("http://www.stackoverflow.com");

    var nodes = doc.DocumentNode.SelectNodes("//img[@src]");

    foreach (var node in nodes)
    {
            Console.WriteLine(node.src);
    }
patjbs
+1 Though Antony Carthy specifically asked for a regex I agree that a parser is a more suitable approach.
Onots
ooh, +1 for HTML parsers. I've fallen prey to the old "use regexes to parse html" trap many times - that always goes sideways.
Electrons_Ahoy