tags:

views:

524

answers:

4

XML spec defines a subset of Unicode characters which are allowed in XML documents: http://www.w3.org/TR/REC-xml/#charsets.

How do I filter out these characters from a String in Java?

simple test case:

  Assert.equals("", filterIllegalXML(""+Character.valueOf((char) 2)))
+1  A: 

Using StringEscapeUtils.escapeXml(xml) from commons-lang. It will escape, not filter the characters.

Bozho
I am already using this method to escape entities (e.g. `<` to `<`), but that's something different. The method doesn't seem to filter any illegal characters. It fails for my 'test case'.
Grzegorz Oledzki
show the test case.
Bozho
As stated in question:`assertEquals("", StringEscapeUtils.escapeXml(""+Character.valueOf((char) 2)));`
Grzegorz Oledzki
ah, sorry. well, I'm not sure there is a way for this character to get into the xml :) Perhaps commons-lang misses it. Actually - what is your version of commons-lang?
Bozho
@Bozho: My project is currently using 2.4, but I've just checked that in 2.5 too. There is no difference.
Grzegorz Oledzki
A: 

You can use regex (Regular Expression) to do the work.

Tom Brito
+1  A: 

This page includes a Java method for stripping out invalid XML characters.

Incidentally, escaping the characters is not a solution since the XML 1.0 and 1.1 specs do not allow the invalid characters in escaped form either.

Stephen C
+3  A: 

It's not trivial to find out all the invalid chars for XML. You need to call or reimplement the XMLChar.isInvalid() from Xerces,

http://kickjava.com/src/org/apache/xerces/util/XMLChar.java.htm

ZZ Coder
+1, nice find..
Bozho