views:

347

answers:

1

I've looked for tutorials on using HTML Agility Pack as it seems to do everything I want it to do but it seems that for such a powerful tool there is little noise about it on the Internet.

I am writing a simple method that will retrieve any given tag based on name:

public string[] GetTagsByName(string TagName, string Source) {
    ...
}

This can be easily done using a Regular Expression but we all know that using the regex for parsing HTML isn't right. So far I have the following code:

...
// TODO: Clear Comments (can this be done or should I use RegEx?)
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(Source);
ArrayList tags = new ArrayList();
string xpath = "//" + TagName;
foreach (HtmlTextNode node in doc.DocumentNode.SelectNodes(xpath) {
    tags.Add(node.Text);
}
return (string[])tags.ToArray(typeof(String));

I would like to be able to first strip all comments from the HTML, then return the correct tag based on its name. If possible I'd also like to return certain meta-tags based on attribute, such as robot. I'm not that great with xpath, so any help with that would be good.

Any help would be much appreciated.

+1  A: 

HtmlAgilityPack's HtmlDocument implements IXpathNavigable, thus it uses the standard .NET XPath engine. Any XPath 1.0 documentation will be applicable, especially if it talks about System.Xml.XPath.

"//comment()" finds all comments
"//meta" finds all "meta" elements

HtmlDocument was designed to look very much like XmlDocument, so examples and tutorials about it will be somewhat applicable.

Some MSDN links:

Lachlan Roche