ansaurus

Question

Removing anything between XML tags and their content

Answer 1

+5 A:

Please don't use regular expressions to parse XML. It's horribly error prone.

Use a proper XML library, which will make this trivial. There are XML libraries available for just about every programming platform you could ask for - there's really no excuse to use a regular expression for XML.

Jon Skeet 2009-07-20 19:04:58

53 seconds!

tj111 2009-07-20 19:06:32

Yikes, 1min 12sec here!!!

Janie 2009-07-20 19:07:34

Answer 2

+1 A:

You shouldn't use regex to parse XML or HTML, it's just not reliable and there are way too many edge cases. You should use a XML/HTML parser for this kind of stuff instead.

tj111 2009-07-20 19:05:52

Answer 3

+1 A:

Don't use regex. Try parsing the XML into a DOM, and manipulating from there (what language/framework are you using?);

Janie 2009-07-20 19:06:03

Answer 4

+2 A:

It is generally not a good idea to parse XML using regular expressions. One of the major benefits of XML is that there are dozens of well-tested parsers out there for any language/framework that you might ever want. There are some tricky rules within XML that prevent any regular expression from being able to properly parse XML.

That said, something like:

s/>.*?</></gs

(that is perl syntax) might do what you want. That says take anything from a greater than up to a less than, and strip it away. The "g" at the end says to perform the substitution as many times as needed, and the "s" makes the "." match all characters INCLUDING newlines (otherwise newlines would not be included, so the pattern would need to be run once for each line, and it would not cover tags that span multiple lines).

Adam Batkin 2009-07-20 19:08:11

Answer 5

A:

thanks for the comments, but i don't want to parse xml by hand. i'm using hpricot (ruby) already, but i'm stuck on version 0.6.164 since we're running on jruby. and unfortunately hpricot often returns weird nodes that contain whitespace and line breaks only. so i thought about cleaning up the xml string before converting it into an hpricot document. alternative solutions appreciated ;)

an example from a test: NoMethodError: undefined method `children' for "\n ":Hpricot::Text

2009-07-21 11:20:33

Edit your question, do not add comments in answers

bortzmeyer 2009-07-21 13:55:17

I did. Thanks for the hint!

rubiii 2009-08-30 18:47:57

Answer 6

A:

A solution is to select all "blank" text nodes and remove them.

doc = Nokogiri(xml_source)
doc.xpath('//text()[not(normalize-space())]').remove

mislav 2009-10-06 21:38:27

ansaurus

tags:

views:

answers:

Removing anything between XML tags and their content

related questions