




Using Java, I want to go through the lines of a text and replace all ampersand symbols (&) with the XML entity reference &.

I scan the lines of the text and then each word in the text with the Scanner class. Then I use the CharacterIterator to iterate over each characters of the word. However, how can I replace the character? First, Strings are immutable objects. Second, I want to replace a character (&) with several characters(amp&;). How should I approach this?

CharacterIterator it = new StringCharacterIterator(token);
for(char ch = it.first(); ch != CharacterIterator.DONE; ch = it.next()) {
       if(ch == '&') {

Try using String.replaceAll() instead.

String my_new_str = my_str.replaceAll("&", "&");
Be careful with replaceAll, because it uses its first argument as regular expression. I.e. "h.e.l.l.o".replaceAll(".", ",") will give you ",,,,,,,,,"! In Java 1.5 there is new String.replace(CharSequence, CharSequence) method, which does something similar, but doesn't interpret first argument as regular expression.
StringBuffer s = new StringBuffer(token.length());

CharacterIterator it = new StringCharacterIterator(token);
for (char ch = it.first(); ch != CharacterIterator.DONE; ch = it.next()) {
    switch (ch) {
        case '&':
        case '<':
        case '>':

token = s.toString();
Just create a string that contains all of the data in question and then use String.replaceAll() like below.

String result = yourString.replaceAll("&", "&amp;");
Have a look at this method.

Notice the parameters types to replace(char,char) - it does single-character substitution.
If you're using Spring you can simply call HtmlUtils.htmlEscape(String input) which will handle the '&' to '&' translation.

That is risky because HTML has many more entities defined than pure XML.
Escaping strings can be tricky - especially if you want to take unicode into account. I suppose XML is one of the simpler formats/languages to escape but still. I would recommend taking a look at the StringEscapeUtils class in Apache Commons Lang, and its handy escapeXml method.

You may also want to check to make sure your not replacing an occurrence that has already been replaced. You can use a regular expression with negative lookahead to do this.

For example:

String str = "sdasdasa&amp;adas&dasdasa";
str = str.replaceAll("&(?!amp;)", "&amp;");

This would result in the string "sdasdasa&adas&dasdasa".

The regex pattern "&(?!amp;)" basically says: Match any occurrence of '&' that is not followed by 'amp;'.

The simple answer is:

token = token.replace("&", "&amp;");

Despite the name as compared to replaceAll, replace does do a replaceAll, it just doesn't use a regular expression, which seems to be in order here (both from a performance and a good practice perspective - don't use regular expressions by accident as they have special character requirements which you won't be paying attention to).

Sean Bright's answer is probably as good as is worth thinking about from a performance perspective absent some further target requirement on performance and performance testing, if you already know this code is a hot spot for performance, if that is where your question is coming from. It certainly doesn't deserve the downvotes. Just use StringBuilder instead of StringBuffer unless you need the synchronization.

That being said, there is a somewhat deeper potential problem here. Escaping characters is a known problem which lots of libraries out there address. You may want to consider wrapping the data in a CDATA section in the XML, or you may prefer to use an XML library (including the one that comes with the JDK now) to actually generate the XML properly (so that it will handle the encoding).

Apache also has an escaping library as part of Commons Lang.
