I have a need to verify a specific hyperlink exists on a given web page. I know how to download the source HTML. What I need help with is figuring out if a "target" url exists as a hyperlink in the "source" web page.
Here is a little console program to demonstrate the problem:
public static void Main()
{
var sourceUrl = "http://developer.yahoo.com/search/web/V1/webSearch.html";
var targetUrl = "http://developer.yahoo.com/ypatterns/";
Console.WriteLine("Source contains link to target? Answer = {0}",
SourceContainsLinkToTarget(
sourceUrl,
targetUrl));
Console.ReadKey();
}
private static bool SourceContainsLinkToTarget(string sourceUrl, string targetUrl)
{
string content;
using (var wc = new WebClient())
content = wc.DownloadString(sourceUrl);
return content.Contains(targetUrl); // Need to ensure this is in a <href> tag!
}
Notice the comment on the last line. I can see if the target URL exists in the HTML of the source URL, but I need to verify that URL is inside of a <href/>
tag. This way I can validate it's actually a hyperlink, instead of just text.
I'm hoping someone will have a kick-ass regular expression or something I can use.
Thanks!
Here is the solution using the HtmlAgilityPack:
private static bool SourceContainsLinkToTarget(string sourceUrl, string targetUrl)
{
var doc = (new HtmlWeb()).Load(sourceUrl);
foreach (var link in doc.DocumentNode.SelectNodes("//a[@href]"))
if (link.GetAttributeValue("href",
string.Empty).Equals(targetUrl))
return true;
return false;
}