tags:

views:

375

answers:

4

I need to strip all xml tags from an xml document, but keep the space the tags occupy, so that the textual content stays at the same offsets as in the xml. This needs to be done in Java, and I thought RegExp would be the way to go, but I have found no simple way to get the length of the tags that match my regular expression.

Basically what I want is this:

Pattern p = Pattern.compile("<[^>]+>[^<]*]+>"); 
Matcher m = p.matcher(stringWithXMLContent); 
String strippedContent = m.replaceAll("THIS IS A STRING OF WHITESPACES IN THE LENGTH OF THE MATCHED TAG");

Hope somebody can help me to do this in a simple way!

+4  A: 

Since < and > characters always surround starting and ending tags in XML, this may be simpler with a straightforward statemachine. Simply loop over all characters (in some writeable form - not stored in a string), and if you encounter a < flip on the "replacement mode" and start replacing all characters with spaces until you encounter a >. (Be sure to replace both the initial < and the closing >).

If you care about layout, you may wish to avoid replacing tab characters and/or newline characters. If all you care about is overall string length, that obviously won't matter.

Edit: If you want to support comments, processing instructions and/or CData sections, you'll need to explicitly recognize these too; also, attribute values unfortunately can include > as well; all this means a full-fledged implementation will be more complex that you'd like.

A regular transducer would be perfect for this task; but unfortunately those aren't exactly commonly found in class libraries...

Eamon Nerbonne
This basically works, but if you encounter a comment around some tags, e.g. <!-- <tag>Hello</tag> -->, the closing "-->" will stay there (and also the "Hello" inside the tag). But it should be good enough for my needs at the moment.
darklight
If string offsets weren't an issue, you could simply use the XPath expression "//text()" - padded with whitespace to fill the same string-length...
Eamon Nerbonne
+1  A: 
Pattern p = Pattern.compile("<[^>]+>[^<]*]+>");

In the spirit of You Can't Parse XML With Regexp, you do know that's not an adequate pattern for arbitrary XML, right? (It's perfectly valid to have a > character in an attribute value, for example, not to mention other non-tag constructs.)

I have found no simple way to get the length of the tags that match my regular expression.

Instead of using replaceAll, repeatedly call find on the Matcher. You can then read start/end to get the indexes to replace, or use the appendReplacement method on a buffer. eg.

StringBuffer b= new StringBuffer();
while (m.find()) {
    String spaces= StringUtils.repeat(" ", m.end()-m.start());
    m.appendReplacement(b, spaces);
}
m.appendTail(b);
stringWithXMLContent= b.toString();

(StringUtils comes from Apache Commons. For more background and library-free alternatives see this question.)

bobince
Thanks, this is exactly the functionality of the Matcher I have been looking for!
darklight
+1  A: 

Why not use an xml pull parser and simply echo everything that you want to keep as you encounter it, e.g. character content and whenever you reach a start or end tag find out the length using the name of the element, plus any attributes that it has and write the appropriate number of spaces.

The SAX API also has callbacks for ignoreable whitespace. So you can also echo all whitespace that occurs in your document.

DaveJohnston
A: 

Hi

Maybe m.start() and m.end() can help.

m.start() => "The index of the first character matched" m.end() => "The offset after the last character matched"

(m.end() - m.start())-2 and you know how many /s you need.

sorry, overlooked the post from bobince