views:

2297

answers:

5

Possibly a lame question. But I've yet to find an answer to it.

Currently I use .Net WebBrowser.Document.Images() to do this.

It requires the Webrowser to load the document. Its messy and takes up resources.

According to this Question Xpath is better than a regex at this.

Anyone know how to do this in C#?

Thanks

A: 

If it's valid xhtml, you could do this:

XmlDocument doc = new XmlDocument();
doc.LoadXml(html);
XmlNodeList results = doc.SelectNodes("//img/@src");
Khoth
Good luck loading 90% of the html pages out there into an XmlDocument :)
axel_c
Already tried this. HTML is not valid XML. And thus throws an exception.
Roberto Bonini
A: 

I'd use a regular expression like: "<[iI][mM][gG].?[sS][rR][cC]=\"(.?)\".*?>"

Then group 1 will be the URL. This also works for non-xhtml pages.

GavinCattell
There's a small error in your regex: you forgot the * before the first and second ?
rslite
+1  A: 

If all you need is images I would just use a regular expression. Something like this should do the trick:

Regex rg = new Regex(@"<img.*?src=""(.*?)""", RegexOptions.IgnoreCase);
rslite
+11  A: 

If your input string is valid XHTML you can treat is as xml, load it into an xmldocument, and do XPath magic :) But it's not always the case.

Otherwise you can try this function, that will return all image links from HtmlSource :

public List<Uri> FetchLinksFromSource(string htmlSource)
{
    List<Uri> links = new List<Uri>();
    string regexImgSrc = @"<img[^>]*?src\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?>";
    MatchCollection matchesImgSrc = Regex.Matches(source, regexImgSrc, RegexOptions.IgnoreCase | RegexOptions.Singleline);
    foreach (Match m in matchesImgSrc)
    {
        string href = m.Groups[1].Value;
        links.Add(new Uri(href));
    }
    return links;
}

And you can use it like this :

HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.example.com");
request.Credentials = System.Net.CredentialCache.DefaultCredentials;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
    using( StreamReader sr = new StreamReader( response.GetResponseStream() )
    {
        List<Uri> links = FetchLinksFromSource( sr.ReadToEnd() );
    }
}
mathieu
+3  A: 

The big issue with any HTML parsing is the "well formed" part. You've seen the crap HTML out there - how much of it is really well formed? I needed to do something similar - parse out all links in a document (and in my case) update them with a rewritten link. I found the Html Agility Pack over on CodePlex. It rocks (and handles malformed HTML).

Here's a snippet for iterating over links in a document:

HtmlDocument doc = new HtmlDocument();
doc.Load(@"C:\Sample.HTM");
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//a/@href");

Content match = null;

// Run only if there are links in the document.
if (linkNodes != null)
{
    foreach (HtmlNode linkNode in linkNodes)
    {
        HtmlAttribute attrib = linkNode.Attributes["href"];
        // Do whatever else you need here
    }
}

Original Blog Post

Paul Mrozowski