I need to be able to remove non-XHTML tags from a string containing XHTML that has been stored in a database. The string also contains references for controls (e.g. ) inside the XHTML, but I need clean XHTML with all standard tag contents unchanged.
These control tags are varied (they could be any ASP.NET control), so there are too many to go looking for each one and remove them. The way they are closed is also varied, so not all of them have closing tags, some are self closing.
How can I go about doing this? I've found some HTML cleaners on-line for including in my project, but they either remove everything or just HTML encode the entire string.
Also, I'm dealing with parts of XHTML documents, not entire documents - don't know if that makes a difference.
Any help would be appreciated.
An example (not fantastic, but gives you the idea of what I'm working with):
<p><mycontrols:mycontrol myproperty="hello world" myproperty2="7"><SPAN><a href="#"><img title="an example image" height="68" width="180" alt="an example image" src="images/example1.gif"></a></span></mycontrols:mycontrol><a href="#"></a></p>
Needs to become:
<p><a href="#"></a></p>