ansaurus

Question

Use RegExp to replace XML tags with whitespaces (in the length of the tags)

Answer 1

+4 A:

Since < and > characters always surround starting and ending tags in XML, this may be simpler with a straightforward statemachine. Simply loop over all characters (in some writeable form - not stored in a string), and if you encounter a < flip on the "replacement mode" and start replacing all characters with spaces until you encounter a >. (Be sure to replace both the initial < and the closing >).

If you care about layout, you may wish to avoid replacing tab characters and/or newline characters. If all you care about is overall string length, that obviously won't matter.

Edit: If you want to support comments, processing instructions and/or CData sections, you'll need to explicitly recognize these too; also, attribute values unfortunately can include > as well; all this means a full-fledged implementation will be more complex that you'd like.

A regular transducer would be perfect for this task; but unfortunately those aren't exactly commonly found in class libraries...

Eamon Nerbonne 2009-08-26 13:30:11

This basically works, but if you encounter a comment around some tags, e.g. , the closing "-->" will stay there (and also the "Hello" inside the tag). But it should be good enough for my needs at the moment.

darklight 2009-08-26 13:47:20

If string offsets weren't an issue, you could simply use the XPath expression "//text()" - padded with whitespace to fill the same string-length...

Eamon Nerbonne 2009-08-26 14:18:58

Answer 2

+1 A:

Pattern p = Pattern.compile("<[^>]+>[^<]*]+>");

In the spirit of You Can't Parse XML With Regexp, you do know that's not an adequate pattern for arbitrary XML, right? (It's perfectly valid to have a > character in an attribute value, for example, not to mention other non-tag constructs.)

I have found no simple way to get the length of the tags that match my regular expression.

Instead of using replaceAll, repeatedly call find on the Matcher. You can then read start/end to get the indexes to replace, or use the appendReplacement method on a buffer. eg.

StringBuffer b= new StringBuffer();
while (m.find()) {
    String spaces= StringUtils.repeat(" ", m.end()-m.start());
    m.appendReplacement(b, spaces);
}
m.appendTail(b);
stringWithXMLContent= b.toString();

(StringUtils comes from Apache Commons. For more background and library-free alternatives see this question.)

bobince 2009-08-26 13:55:15

Thanks, this is exactly the functionality of the Matcher I have been looking for!

darklight 2009-08-26 15:02:24

Answer 3

+1 A:

Why not use an xml pull parser and simply echo everything that you want to keep as you encounter it, e.g. character content and whenever you reach a start or end tag find out the length using the name of the element, plus any attributes that it has and write the appropriate number of spaces.

The SAX API also has callbacks for ignoreable whitespace. So you can also echo all whitespace that occurs in your document.

DaveJohnston 2009-08-26 14:32:24

Answer 4

A:

Hi

Maybe m.start() and m.end() can help.

m.start() => "The index of the first character matched" m.end() => "The offset after the last character matched"

(m.end() - m.start())-2 and you know how many /s you need.

2009-08-27 13:17:26

sorry, overlooked the post from bobince

2009-08-27 13:20:41

ansaurus

tags:

views:

answers:

Use RegExp to replace XML tags with whitespaces (in the length of the tags)

related questions