views:

155

answers:

4

I'm using VB.Net in an ASP.Net 2.0 app to run some regular expressions that remove some unnecessary markup. One of the things that I'd like to do is remove span elements that don't have any attributes in them:

output = Regex.Replace(output, "<span\s*>(?<Text>.*?)</span>" & styleRegex, "${Text}", RegexOptions.Compiled Or RegexOptions.CultureInvariant Or RegexOptions.IgnoreCase Or RegexOptions.Singleline)

So for this content:

<span>Lorem <span class="special">ipsum</span> dolor sit amet.</span>

I'd like to remove the outer span elements. Unfortunately, my regex above gives me this as a result, since the closing span matches the first one it comes across:

Lorem <span class="special">ipsum dolor sit amet.</span>

Is this possible with a RegEx or will I have to implement something a bit more advanced?

A: 

I would use XSLT rather than regex.

It seems .NET has good support for XSLT (google: xslt vb.net) but I don't know whether it will parse non-XHTML. The standard xsltproc command will, with the --html flag.

Eric Drechsel
+2  A: 

Unfortunatly, regular expressions do not have this power. You'd need at least a context-sensitive language to express something like that. (sorry for the theoretical stuff)

I'd also propose to use XSLT instead.

Thomas Danecker
A: 

HTML agility pack should help with this.

HTML Agility Pack on Codeplex

Jeff Meatball Yang
A: 

XSLT isn't an option since the input may not always be valid XML and the HTML Agility Pack on Codeplex looks pretty sweet but is really overkill in this case. Here's the final RegEx I ended up using:

<span\s*>(?<Text>.*?(?:<span[^>]*>.*?</span>.*?)*)</span>

Replacing that with ${Text} effectively stripped the useless outer span tags in all cases I've tested.

travis