views:

29

answers:

1

I have a need to verify a specific hyperlink exists on a given web page. I know how to download the source HTML. What I need help with is figuring out if a "target" url exists as a hyperlink in the "source" web page.

Here is a little console program to demonstrate the problem:

public static void Main()
{
    var sourceUrl = "http://developer.yahoo.com/search/web/V1/webSearch.html";
    var targetUrl = "http://developer.yahoo.com/ypatterns/";
    Console.WriteLine("Source contains link to target? Answer = {0}",
                      SourceContainsLinkToTarget(
                          sourceUrl,
                          targetUrl));
    Console.ReadKey();
}

private static bool SourceContainsLinkToTarget(string sourceUrl, string targetUrl)
{
    string content;
    using (var wc = new WebClient())
        content = wc.DownloadString(sourceUrl);
    return content.Contains(targetUrl); // Need to ensure this is in a <href> tag!
}

Notice the comment on the last line. I can see if the target URL exists in the HTML of the source URL, but I need to verify that URL is inside of a <href/> tag. This way I can validate it's actually a hyperlink, instead of just text.

I'm hoping someone will have a kick-ass regular expression or something I can use.

Thanks!


Here is the solution using the HtmlAgilityPack:

   private static bool SourceContainsLinkToTarget(string sourceUrl, string targetUrl)
    {
        var doc = (new HtmlWeb()).Load(sourceUrl);
        foreach (var link in doc.DocumentNode.SelectNodes("//a[@href]"))
            if (link.GetAttributeValue("href",
                                       string.Empty).Equals(targetUrl))
                return true;
        return false;
    }
+2  A: 

The best way is to use a web scraping library with a built in DOM parser, which will build an object tree out of the HTML and let you explore it programmatically for the link entity you are looking for. There are many available - for example Beautiful Soup (python) or scrapi (ruby) or Mechanize (perl). For .net, try the HTML agility pack. http://htmlagilitypack.codeplex.com/

Joshua
@Joshua The HTML agility pack looks like it was designed to do exactly what I need. The example usage specifically mentioned getting hyperlink urls. Thanks!
Paul Fryer
FYI the HtmlAgilityPack is an awesome framework for HTML handling.
Paul Fryer