ansaurus

Question

Answer 1

+2 A:

Is this html that you control? If so, you could give the p an id or a class and find it via

//p[@id=\"YOUR ID\"] or //p[@class=\"YOUR CLASS\"]

EDIT: Since you don't control the html, maybe the below will work. It takes all the HtmlTextNodes and tries to find a grouping of text greater than the threshold specified. It's far from perfect but might get you going in the right direction.

String summary = FindSummary(page.DocumentNode);

private const int THRESHOLD = 50;
private String FindSummary(HtmlAgilityPack.HtmlNode node) {
    foreach (HtmlAgilityPack.HtmlNode childNode in node.ChildNodes) {
        if (childNode.GetType() == typeof(HtmlAgilityPack.HtmlTextNode)) {
            if (childNode.InnerText.Length >= THRESHOLD) {
                return childNode.InnerText;
            }
        }

        String summary = FindSummary(childNode);
        if (summary.Length >= THRESHOLD) {
            return summary;
        }
    }

    return String.Empty;
}

BStruthers 2009-11-23 15:37:26

I don't control the HTML, users can submit any page they like, so I don't know what the ID or class of the container will be

reach4thelasers 2009-11-23 17:24:01

Thanks! That is what I was looking for!

reach4thelasers 2009-11-25 21:21:48

Answer 2

A:

The agility pack uses xpath for querying the html load you just use a simple xpath statement. Something like...

HtmlDocument htmldoc = new HtmlDocument();
htmldoc.LoadHtml(content);

HtmlNodeCollection firstParagraph = htmldoc.DocumentNode.SelectNodes("//p[1]");

Sheff 2009-11-23 15:38:29

ansaurus

tags:

views:

answers:

HTML Agility Pack - Get Page Summary

related questions