tags:

views:

187

answers:

2

Possible Duplicate:
RegEx match open tags except XHTML self-contained tags

Hi all,

I know how everyone loves a regex question, so here is mine. I have an XML tree within which some nodes contain CDATA. How do I return just a string containing the data?

Lets see an example

<xml>
  <node>I'm plain text.</node>
  <node><![CDATA[I'm text in cdata... and may contain html, <strong>yikes!</strong>]]></node>
</xml>

Would return

I'm plain text. I'm text in cdata... and may contain html, yikes!

I've read about not parsing an irregular language with a regular one, but I'm sure this is doable. What do you reckon guys?

Thanks, Kevin

EDIT: This was a problem that needed a quick and dirty solution to deal with a few lines of XML. I was surprised at the initial flat refusal, but from further reading (in particular from links provided later on) I see that experienced programmers know it's something that should be avoided wherever possible. Live and learn. Thanks.

+5  A: 

Don't use regex, use an XML/HTML parser.

This issue has been beaten to death.

John Weldon
"That's right, if you attempt to parse HTML with regular expressions, you're succumbing to the temptations of the dark god Cthulhu's … er … code" - Jeff Atwood. http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
TandemAdam
A: 

Look at boilerpipe for an example of how hard it is to solve this problem.

bmargulies