ansaurus

Question

Regular Expression to find src from IMG tag.

Answer 1

+7 A:

You don't want a regular expression, you want a parser. From this question:

class Program
{
    static void Main(string[] args)
    {
        var web = new HtmlWeb();
        var doc = web.Load("http://www.stackoverflow.com");

        var nodes = doc.DocumentNode.SelectNodes("//img[@src]");

        foreach (var node in nodes)
        {
                Console.WriteLine(node.src);
        }
    }
}

Paolo Bergantino 2009-06-11 06:43:18

That depends on the requirements of the person. What if he want's to extact it from an user input?

Paulo Santos 2009-06-11 06:45:24

He could still load it to a parser, and even more so if it's from a user. It's been discussed ad-nauseam why regular expressions are a bad idea to parse HTML.

Paolo Bergantino 2009-06-11 06:47:06

Wow, it's a battle of the 'Pa[ou]lo's :-)

paxdiablo 2009-06-11 06:56:35

Answer 2

A:

I have to agree with the parser-crowd on this one. In order of increasing input complexity, the hierarchy I choose from is:

substrings;
regexes; and
parsers.

While regexes can handle much more complicated inputs than simple substring operations, they tend to barf pretty easily when faced with the really hairy input possibilities of free-form markup languages.

XML DOM parsers will be the easiest solution for this problem.

You can use regexes (and they'll work reasonably well if you restrict the input format, such as ensuring img tags don't cross line boundaries and so on), but the simplicity of a parser-based solution will blow regexes out of the water for multi-line, attributes-in-any-order DOM tags.

paxdiablo 2009-06-11 06:53:24

Answer 3

A:

As pointed out, regular expression are not the perfect solution, but you can usually build one that is good enough for the job. This is what I would use:

string newHtml = Regex.Replace(html,
      @"(?<=<img\s+[^>]*?src=(?<q>['""]))(?<url>.+?)(?=\k<q>)",
      m => "http://www.stackoverflow.com" + m.Value);

It will match src attributes delimited by single or double quotes.

Of course, you would have to change the lambda/delegate to do your own replacing logic, but you get the idea :)

Philippe Leybaert 2009-06-11 07:14:01

Answer 4

A:

Remember that the source could be generated through javascript, so you may not be able to "just" do a regex replacement for img src.

Using Mechanize/Hpricot/Nokogiri in ruby:

require 'mechanize'
agent = WWW::Mechanize.new
page  = agent.get('http://www.google.com')
(page/"img").each { |img| puts img['src'] = "http://www.yahoo.com" + img['src'] }

And you are done!

Ryan Oberoi 2009-06-11 07:35:59

Answer 5

A:

/// <summary>
/// Gets the src from an IMG tag
/// Assigns proper values to link and name, if the htmlId matches the pattern
/// </summary>
/// <param name="htmlTd">Html containing IMG tag</param>
/// <param name="link">Contains the src contents</param>
/// <param name="name">Contains img element content</param>
/// <returns>true if success, false otherwise</returns>
public static bool TryGetImgDetails(string htmlTd, out string link, out string name)
{
    link = null;
    name = null;

    string pattern = "<img\\s*src\\s*=\\s*(?:\"(?<link>[^\"]*)\"|(?<link>\\S+))\\s*>(?<name>.*)\\s*</img>";

    if (Regex.IsMatch(htmlTd, pattern))
    {
        Regex r = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);
        link = r.Match(htmlTd).Result("${link}");
        name = r.Match(htmlTd).Result("${name}");
        return true;
    }
    else
        return false;
}

Rashmi Pandit 2009-06-15 08:25:45

ansaurus

tags:

views:

answers:

Regular Expression to find src from IMG tag.

related questions