views:

44

answers:

1

I want to extract meaningful text out of an html document and I was using html-agility-pack for the same. Here is my code:

string convertedContent = HttpUtility.HtmlDecode(ConvertHtml(HtmlAgilityPack.HtmlEntity.DeEntitize(htmlAsString)));

ConvertHtml:

    public string ConvertHtml(string html)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);

        StringWriter sw = new StringWriter();
        ConvertTo(doc.DocumentNode, sw);
        sw.Flush();
        return sw.ToString();
    }

ConvertTo:

    public void ConvertTo(HtmlAgilityPack.HtmlNode node, TextWriter outText)
    {
        string html;
        switch (node.NodeType)
        {
            case HtmlAgilityPack.HtmlNodeType.Comment:
                // don't output comments
                break;

            case HtmlAgilityPack.HtmlNodeType.Document:
                foreach (HtmlNode subnode in node.ChildNodes)
                {
                  ConvertTo(subnode, outText);
                }
                break;

            case HtmlAgilityPack.HtmlNodeType.Text:
                // script and style must not be output
                string parentName = node.ParentNode.Name;
                if ((parentName == "script") || (parentName == "style"))
                    break;

                // get text
                html = ((HtmlTextNode)node).Text;

                // is it in fact a special closing node output as text?
                if (HtmlNode.IsOverlappedClosingElement(html))
                    break;

                // check the text is meaningful and not a bunch of whitespaces
                if (html.Trim().Length > 0)
                {
                    outText.Write(HtmlEntity.DeEntitize(html) + " ");
                }
                break;

            case HtmlAgilityPack.HtmlNodeType.Element:
                switch (node.Name)
                {
                    case "p":
                        // treat paragraphs as crlf
                        outText.Write("\r\n");
                        break;
                }

                if (node.HasChildNodes)
                {
                foreach (HtmlNode subnode in node.ChildNodes)
                 {
                  ConvertTo(subnode, outText);
                 }
                }
                break;
        }
    }

Now in some cases when the html pages are malformed (for example the following page - http://rareseeds.com/cart/products/Purple_of_Romagna_Artichoke-646-72.html has a malformed meta-tag like <meta content="text/html; charset=uft-8" http-equiv="Content-Type">) [Note "uft" instead of utf] my code is puking at the time I am trying to load the html document.

Can someone suggest me how can I overcome these malformed html pages and still extract relevant text out of a html document?

Thanks, Kapil

+1  A: 

As it is said in the HtmlAgilityPack project page "The parser is very tolerant with 'real world' malformed HTML". But the kind of error you describe is too serious maybe to be corrected. You can set the default encoding with:

 HtmlDocument doc = new HtmlDocument();
 doc.OptionDefaultStreamEncoding = Encoding.UTF8;
PanJanek