views:

579

answers:

1

hi

I'm trying to parse a html doc using some code I found from this actual site but I keep getting a parsing error

HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

        // There are various options, set as needed
        htmlDoc.OptionFixNestedTags = true;

        // filePath is a path to a file containing the html
        htmlDoc.Load(@"C:\Documents and Settings\Mine\My Documents\Random.html");

        // Use:  htmlDoc.LoadXML(xmlString);  to load from a string

        // ParseErrors is an ArrayList containing any errors from the Load statement
        if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count > 0)
        {
            // Handle any parse errors as required
            MessageBox.Show("Oh no");
        }
        else
        {

            if (htmlDoc.DocumentNode != null)
            {
                HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//head");

                if (bodyNode != null)
                {
                    MessageBox.Show("Hello");
                }
            }
        }

Any help would be appreciated :)

A: 

In the wild, HTML is likely to be non-conformant, non-compliant, and non-validating. Only XHTML or very simple HTML will go without populating ParseErrors. I've noticed that the HTML Agility Pack is fairly robust and will still build a decent DOM tree from most HTML sources, even if ParseErrors are generated. Drop the else, and let that else block execute normally.

If it did not build the DOM tree, then you should investigate the ParseError(s) that were generated. If it only built a partial tree, try recursing over the nodes, printing or messagebox'ing to see which parts of the DOM tree got built or not. You might not need the whole tree.

Allan