How would I use the HTML Agility Pack to get the First Paragraph of text from the body of an HTML file. I'm building a DIGG style link submission tool, and want to get the title and the first paragraph of text. Title is easy, any suggestions for how I might get the first paragraph of text from the body? I guess it could be within P or DIV depending on the page.
+2
A:
Is this html that you control? If so, you could give the p an id or a class and find it via
//p[@id=\"YOUR ID\"] or //p[@class=\"YOUR CLASS\"]
EDIT: Since you don't control the html, maybe the below will work. It takes all the HtmlTextNodes and tries to find a grouping of text greater than the threshold specified. It's far from perfect but might get you going in the right direction.
String summary = FindSummary(page.DocumentNode);
private const int THRESHOLD = 50;
private String FindSummary(HtmlAgilityPack.HtmlNode node) {
foreach (HtmlAgilityPack.HtmlNode childNode in node.ChildNodes) {
if (childNode.GetType() == typeof(HtmlAgilityPack.HtmlTextNode)) {
if (childNode.InnerText.Length >= THRESHOLD) {
return childNode.InnerText;
}
}
String summary = FindSummary(childNode);
if (summary.Length >= THRESHOLD) {
return summary;
}
}
return String.Empty;
}
BStruthers
2009-11-23 15:37:26
I don't control the HTML, users can submit any page they like, so I don't know what the ID or class of the container will be
reach4thelasers
2009-11-23 17:24:01
Thanks! That is what I was looking for!
reach4thelasers
2009-11-25 21:21:48
A:
The agility pack uses xpath for querying the html load you just use a simple xpath statement. Something like...
HtmlDocument htmldoc = new HtmlDocument();
htmldoc.LoadHtml(content);
HtmlNodeCollection firstParagraph = htmldoc.DocumentNode.SelectNodes("//p[1]");
Sheff
2009-11-23 15:38:29