tags:

views:

6511

answers:

7

Very similar to this question, except for Java.

What is the recommended way of encoding strings for an XML output in Java. The strings might contain characters like "&", "<", etc.

+9  A: 

Very simply: use an XML library. That way it will actually be right instead of requiring detailed knowledge of bits of the XML spec.

Jon Skeet
Can you recommend such a library? (I find it surprising that this is not a standard part of Java edition 5...such a common task).
Tim Cooper
XML *is* part of the standard Java framework - look in org.w3c.sax and org.w3c.dom. However, there are some easier-to-use framework around as well, such as JDom. Note that there may not be an "encoding strings for XML output" method - I was more recommending that the whole XML task should be done with a library rather than just doing bits at a time with string manipulation.
Jon Skeet
This is not such useful advice when outputting XHTML - FlyingSaucer requires XML, but there ain't no way I'm templating through an XML lib :). Thankfully StringTemplate allows me to quickly escape all String objects.
Stephen
@Stephen: I would expect an XHTML library to use an XML library to keep everything sane, but expose an XHTML-centric API. Having to do escaping manually (and make sure you get it right *everywhere*) is not a great idea IMO.
Jon Skeet
To convert a DOM tree to an XML-string, use a transformer without a style sheet.
Thorbjørn Ravn Andersen
+1  A: 

Use JAXP and forget about text handling it will be done for you automatically.

Fernando Miguélez
+4  A: 

Just use.

<![CDATA[ your text here ]]>

This will allow any characters except the ending

]]>

So you can include characters that would be illegal such as & and >. For example.

<element><![CDATA[ characters such as & and > are allowed ]]></element>

However, attributes will need to be escaped as CDATA blocks can not be used for them.

ng
In most cases, that is not what you should do. Too many people abuse the CDATA tags. The intent of the CDATA is to tell the processor not to process it as XML and just pass it through. If you are trying to create an XML file, then you should be creating XML, not just passing bytes through some wrapping element.
Mads Hansen
@Mads, using CDATA results in a valid XML file so it is just as fine as doing it the "right way". If you dislike it, then parse it afterwards, identity transform it, and print it.
Thorbjørn Ravn Andersen
+9  A: 

As others have mentioned, using an XML library is the easiest way. If you do want to escape yourself, you could look into StringEscapeUtils from the Apache Commons Lang library.

Fabian Steeg
This could be the way to go if you don't care about absolute correctness, for example if you are putting together a prototype.
Chase Seibert
+1 Handy suggestion, thanks
Jon
Thanks for pointint this out!
simon
+3  A: 
Aaron Digulla
+1  A: 

This has worked well for me to provide an escaped version of a text string:

public class XMLHelper {

/**
 * Returns the string where all non-ascii and <, &, > are encoded as numeric entities. I.e. "&lt;A &amp; B &gt;"
 * .... (insert result here). The result is safe to include anywhere in a text field in an XML-string. If there was
 * no characters to protect, the original string is returned.
 * 
 * @param originalUnprotectedString
 *            original string which may contain characters either reserved in XML or with different representation
 *            in different encodings (like 8859-1 and UFT-8)
 * @return
 */
public static String protectSpecialCharacters(String originalUnprotectedString) {
    if (originalUnprotectedString == null) {
        return null;
    }
    boolean anyCharactersProtected = false;

    StringBuffer stringBuffer = new StringBuffer();
    for (int i = 0; i < originalUnprotectedString.length(); i++) {
        char ch = originalUnprotectedString.charAt(i);

        boolean controlCharacter = ch < 32;
        boolean unicodeButNotAscii = ch > 126;
        boolean characterWithSpecialMeaningInXML = ch == '<' || ch == '&' || ch == '>';

        if (characterWithSpecialMeaningInXML || unicodeButNotAscii || controlCharacter) {
            stringBuffer.append("&#" + (int) ch + ";");
            anyCharactersProtected = true;
        } else {
            stringBuffer.append(ch);
        }
    }
    if (anyCharactersProtected == false) {
        return originalUnprotectedString;
    }

    return stringBuffer.toString();
}

}
Thorbjørn Ravn Andersen
+3  A: 

While idealism says use an XML library, IMHO if you have a basic idea of XML then common sense and performance says template it all the way. It's arguably more readable too. Though using the escaping routines of a library is probably a good idea.

Consider this: XML was meant to be written by humans.

Use libraries for generating XML when having your XML as an "object" better models your problem. For example, if pluggable modules participate in the process of building this XML.

Edit: as for how to actually escape XML in templates, use of CDATA or escapeXml(string) from JSTL are two good solutions, escapeXml(string) can be used like this:

<%@taglib prefix="fn" uri="http://java.sun.com/jsp/jstl/functions"%&gt;

<item>${fn:escapeXml(value)}</item>
Amr Mostafa