views:

54

answers:

1

I would like a brief and easy way to strip tags from an XHTML document, and believe there has to be something curt enough among all the options like: XSLT, XPath, XQuery, custom C# programming using the .NET XML namespace. I'm open to others.

For example, I want to strip all <b> tags from an XHTML document but keep their inner content and child tags (i.e. not simply skip the bold tag and its children).

I need to maintain the structure of the original document minus the stripped tags.

Thoughts:

  • I've seen XSLT's ability to match elements for selection; however I want to match everything by default with a couple of exceptions, and I'm unsure it's conducive to this. This is what I'm looking at right now.

  • XQuery I haven't started to look into. (Update for XQuery: Took a brief look at this technology and it's comparable enough to SQL in function that I fail to see how it can maintain the nested node structure of the original document - I think this is not a contender).

  • A custom C#/.NET XML namespace program might be viable as I already have an idea for it, but my immediate assumption is it's likely more involved contrasted with the reasons for which these other XML-specific matching languages were created.

  • ... another kind of enabling technology I haven't yet considered...

+3  A: 

I need to maintain the structure of the original document minus the stripped tags

Have you thought of XSLT? This is the language specifically designed for transforming XML and generally tree structures.

This transformation:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
 <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match="b">
  <xsl:apply-templates/>
 </xsl:template>
</xsl:stylesheet>

when applied on any XHTML document, as the one below:

<html>
 <head/>
 <body>
  <p> Hello, <b>World</b>!</p>
 </body>
</html>

produces the wanted, correct result, in this case:

<html>
   <head/>
   <body>
      <p> Hello, World!</p>
   </body>
</html>
Dimitre Novatchev
I had thought of XSLT, in fact just updated the question to relect that because I mistakenly called it XPath. However I couldn't think of a good XSLT for the problem. Apparently you have the solution. I will try it ....
John K
@John-K: You are welcome. Please, don't hesitate to ask if there is something that needs to be explained. :)
Dimitre Novatchev
... works like a charm. Thanks. I'm using it via the XslCompiledTransform Class http://msdn.microsoft.com/en-us/library/system.xml.xsl.xslcompiledtransform(v=VS.90).aspx
John K
@Dimitre: `<xsl:strip-space elements="*"/>` is a bit too much for HTML/XHTML: `<pre>` elements will be cleaned. So I suggest to remove that line.
dolmen