ansaurus

Question

How can I strip non-XHTML tags from a string in C#?

Answer 1

+2 A:

I don't know C#, but I'm sure it has some lenient HTML DOM parsers - lenient in that it can deal with self- or non-closing tags halfway properly.

I guess there's not much other else to do than paringe the tree with such a library, throwing out any node that does not match the valid XHTML tags list, and packing it back into a string again.

Pekka 2010-06-06 14:19:47

Answer 2

A:

Right, this is how I've done it. Using the HtmlAgilityPack. (http://htmlagilitypack.codeplex.com/)

It seems a bit too easy, makes me think I've overlooked possible issues with it, but here is the code:

// Allowed Tags: http://www.w3schools.com/tags/default.asp
string[] allowedTags = { "a", "abbr", "acronym", "address", "applet", "area", "b", "base", 
   "basefont", "bdo", "big", "blockquote", "body", "br", "button", 
   "caption", "center", "cite", "code", "col", "colgroup", "dd", 
   "del", "dfn", "dir", "div", "dl", "dt", "em", "fieldset", "font", 
   "form", "frame", "frameset", "h1", "h2", "h3", "h4", "h5", "h6", 
   "head", "hr", "html", "i", "iframe", "img", "input", "ins", "isindex", 
   "kbd", "label", "legend", "li", "link", "map", "menu", "meta", 
   "noframes", "noscript", "object", "ol", "optgroup", "option", "p", 
   "param ", "pre", "q", "s", "samp", "script", "select", "small", 
   "span", "strike", "strong", "style", "sub", "sup", "table", "tbody", 
   "td", "textarea", "tfoot", "th", "thead", "title", "tr", "tt", "u", 
   "ul", "var", "xmp" };


HtmlAgilityPack.HtmlDocument fullHtml = new HtmlAgilityPack.HtmlDocument();

fullHtml.LoadHtml(myStringOfHtml);

HtmlAgilityPack.HtmlNodeCollection allNodes = fullHtml.DocumentNode.SelectNodes("//*");

if (allNodes != null)
{
    foreach (var item in allNodes)
    {
        if (!allowedTags.Contains(item.Name))
            item.Remove();
    }
}

string output1 = fullHtml.DocumentNode.InnerHtml;

Let me know if you think there are any problems with this. The HTML I'm dealing with always has closing tags and is (relatively) well formed, as it's been through a custom HTML checker written by another company before storing it in a database. So I'm not sure how this works with badly formed HTML.

Thanks to Pekka for the suggestion to take the 'search and destroy' method.

No Average Geek 2010-06-07 14:59:50

ansaurus

tags:

views:

answers:

How can I strip non-XHTML tags from a string in C#?

related questions