tags:

views:

1105

answers:

5

Hi,

I have a web page. From that i want to find all the IMG tags and get the SRC of those IMG tags.

What will be the regular expression to do this.

Some explanation:

I am scraping a web page. All the data is displayed correctly except the images. To solve this, wow i have an idea, to find the SRC and replace it : e.g

/images/header.jpg

and replace this with

www.stackoverflow/images/header.jpg
+7  A: 

You don't want a regular expression, you want a parser. From this question:

class Program
{
    static void Main(string[] args)
    {
        var web = new HtmlWeb();
        var doc = web.Load("http://www.stackoverflow.com");

        var nodes = doc.DocumentNode.SelectNodes("//img[@src]");

        foreach (var node in nodes)
        {
                Console.WriteLine(node.src);
        }
    }
}
Paolo Bergantino
That depends on the requirements of the person. What if he want's to extact it from an user input?
Paulo Santos
He could still load it to a parser, and even more so if it's from a user. It's been discussed ad-nauseam why regular expressions are a bad idea to parse HTML.
Paolo Bergantino
Wow, it's a battle of the 'Pa[ou]lo's :-)
paxdiablo
A: 

I have to agree with the parser-crowd on this one. In order of increasing input complexity, the hierarchy I choose from is:

  • substrings;
  • regexes; and
  • parsers.

While regexes can handle much more complicated inputs than simple substring operations, they tend to barf pretty easily when faced with the really hairy input possibilities of free-form markup languages.

XML DOM parsers will be the easiest solution for this problem.

You can use regexes (and they'll work reasonably well if you restrict the input format, such as ensuring img tags don't cross line boundaries and so on), but the simplicity of a parser-based solution will blow regexes out of the water for multi-line, attributes-in-any-order DOM tags.

paxdiablo
A: 

As pointed out, regular expression are not the perfect solution, but you can usually build one that is good enough for the job. This is what I would use:

string newHtml = Regex.Replace(html,
      @"(?<=<img\s+[^>]*?src=(?<q>['""]))(?<url>.+?)(?=\k<q>)",
      m => "http://www.stackoverflow.com" + m.Value);

It will match src attributes delimited by single or double quotes.

Of course, you would have to change the lambda/delegate to do your own replacing logic, but you get the idea :)

Philippe Leybaert
A: 

Remember that the source could be generated through javascript, so you may not be able to "just" do a regex replacement for img src.

Using Mechanize/Hpricot/Nokogiri in ruby:

require 'mechanize'
agent = WWW::Mechanize.new
page  = agent.get('http://www.google.com')
(page/"img").each { |img| puts img['src'] = "http://www.yahoo.com" + img['src'] }

And you are done!

Ryan Oberoi
A: 
/// <summary>
/// Gets the src from an IMG tag
/// Assigns proper values to link and name, if the htmlId matches the pattern
/// </summary>
/// <param name="htmlTd">Html containing IMG tag</param>
/// <param name="link">Contains the src contents</param>
/// <param name="name">Contains img element content</param>
/// <returns>true if success, false otherwise</returns>
public static bool TryGetImgDetails(string htmlTd, out string link, out string name)
{
    link = null;
    name = null;

    string pattern = "<img\\s*src\\s*=\\s*(?:\"(?<link>[^\"]*)\"|(?<link>\\S+))\\s*>(?<name>.*)\\s*</img>";

    if (Regex.IsMatch(htmlTd, pattern))
    {
        Regex r = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);
        link = r.Match(htmlTd).Result("${link}");
        name = r.Match(htmlTd).Result("${name}");
        return true;
    }
    else
        return false;
}
Rashmi Pandit