tags:

views:

58

answers:

2

I need to be able to remove non-XHTML tags from a string containing XHTML that has been stored in a database. The string also contains references for controls (e.g. ) inside the XHTML, but I need clean XHTML with all standard tag contents unchanged.

These control tags are varied (they could be any ASP.NET control), so there are too many to go looking for each one and remove them. The way they are closed is also varied, so not all of them have closing tags, some are self closing.

How can I go about doing this? I've found some HTML cleaners on-line for including in my project, but they either remove everything or just HTML encode the entire string.

Also, I'm dealing with parts of XHTML documents, not entire documents - don't know if that makes a difference.

Any help would be appreciated.

An example (not fantastic, but gives you the idea of what I'm working with):

<p><mycontrols:mycontrol myproperty="hello world" myproperty2="7"><SPAN><a href="#"><img title="an example image" height="68" width="180" alt="an example image" src="images/example1.gif"></a></span></mycontrols:mycontrol><a href="#"></a></p>

Needs to become:

<p><a href="#"></a></p>
+2  A: 

I don't know C#, but I'm sure it has some lenient HTML DOM parsers - lenient in that it can deal with self- or non-closing tags halfway properly.

I guess there's not much other else to do than paringe the tree with such a library, throwing out any node that does not match the valid XHTML tags list, and packing it back into a string again.

Pekka
A: 

Right, this is how I've done it. Using the HtmlAgilityPack. (http://htmlagilitypack.codeplex.com/)

It seems a bit too easy, makes me think I've overlooked possible issues with it, but here is the code:

// Allowed Tags: http://www.w3schools.com/tags/default.asp
string[] allowedTags = { "a", "abbr", "acronym", "address", "applet", "area", "b", "base", 
   "basefont", "bdo", "big", "blockquote", "body", "br", "button", 
   "caption", "center", "cite", "code", "col", "colgroup", "dd", 
   "del", "dfn", "dir", "div", "dl", "dt", "em", "fieldset", "font", 
   "form", "frame", "frameset", "h1", "h2", "h3", "h4", "h5", "h6", 
   "head", "hr", "html", "i", "iframe", "img", "input", "ins", "isindex", 
   "kbd", "label", "legend", "li", "link", "map", "menu", "meta", 
   "noframes", "noscript", "object", "ol", "optgroup", "option", "p", 
   "param ", "pre", "q", "s", "samp", "script", "select", "small", 
   "span", "strike", "strong", "style", "sub", "sup", "table", "tbody", 
   "td", "textarea", "tfoot", "th", "thead", "title", "tr", "tt", "u", 
   "ul", "var", "xmp" };


HtmlAgilityPack.HtmlDocument fullHtml = new HtmlAgilityPack.HtmlDocument();

fullHtml.LoadHtml(myStringOfHtml);

HtmlAgilityPack.HtmlNodeCollection allNodes = fullHtml.DocumentNode.SelectNodes("//*");

if (allNodes != null)
{
    foreach (var item in allNodes)
    {
        if (!allowedTags.Contains(item.Name))
            item.Remove();
    }
}

string output1 = fullHtml.DocumentNode.InnerHtml;

Let me know if you think there are any problems with this. The HTML I'm dealing with always has closing tags and is (relatively) well formed, as it's been through a custom HTML checker written by another company before storing it in a database. So I'm not sure how this works with badly formed HTML.

Thanks to Pekka for the suggestion to take the 'search and destroy' method.

No Average Geek