tags:

views:

2226

answers:

8

Using Java, I want to go through the lines of a text and replace all ampersand symbols (&) with the XML entity reference &.

I scan the lines of the text and then each word in the text with the Scanner class. Then I use the CharacterIterator to iterate over each characters of the word. However, how can I replace the character? First, Strings are immutable objects. Second, I want to replace a character (&) with several characters(amp&;). How should I approach this?

CharacterIterator it = new StringCharacterIterator(token);
for(char ch = it.first(); ch != CharacterIterator.DONE; ch = it.next()) {
       if(ch == '&') {

       }
}
+15  A: 

Try using String.replaceAll() instead.

String my_new_str = my_str.replaceAll("&", "&");
Amber
Be careful with replaceAll, because it uses its first argument as regular expression. I.e. "h.e.l.l.o".replaceAll(".", ",") will give you ",,,,,,,,,"! In Java 1.5 there is new String.replace(CharSequence, CharSequence) method, which does something similar, but doesn't interpret first argument as regular expression.
Peter Štibraný
+1 for a good reminder.
Amber
@Dav, we don't need your reason for upvoting
Steve Kuo
It's just another way of saying thanks to Peter. We don't need your commentary either, but you're still welcome to give it.
Amber
Dav you have really lucked out on the points with this answer!
Adamski
It just seems like clutter (and not relevant to the topic) to have everyone say (+1 because blah blah blah) when they've already up-voted. That's what the voting is for. SO is starting to degrade into reddit.
Steve Kuo
+1  A: 
StringBuffer s = new StringBuffer(token.length());

CharacterIterator it = new StringCharacterIterator(token);
for (char ch = it.first(); ch != CharacterIterator.DONE; ch = it.next()) {
    switch (ch) {
        case '&':
            s.append("&");
            break;
        case '<':
            s.append("&lt;");
            break;
        case '>':
            s.append("&gt;");
            break;
        default:
            s.append(ch);
            break;
    }
}

token = s.toString();
Sean Bright
You shouldn't need a StringBuffer in this scenario.
Taylor Leese
Using a String instead would result in the creation of a temporary String object per iteration. I'm not sure what alternative you would suggest.
Sean Bright
string.replaceAll?
IRBMe
Are we really assuming that the OP knows about `CharacterInterator` and not `String.replaceAll()`?
Sean Bright
+1: Not sure why this received 2 downvotes - It's likely to be far more efficient than replaceAll() - After all why use regular expressions when simply matching on a single character?
Adamski
Your example solution would need a StringBuffer but the solution to the general problem does not require one.
Taylor Leese
@Taylor L - I guess we just disagree that the question, as asked, is a "general problem."
Sean Bright
Adamski
@Adamski - I was just going to do that performance test myself. Thanks for doing the leg work for me!
Sean Bright
Why complicate the code significantly by prematurely optimizing? Especially when the performance increase is so tiny. Make it work right first, make it readable and maintainable and only after you've done that, if you find you have a performance problem and have profiled your code to pinpoint the exact problem, should you worry about doing microoptimizations like this.
IRBMe
It wasn't premature optimization - it was my answer to the question. It just also happens to faster than `String.replaceAll()`, but that wasn't the reason for suggesting it.
Sean Bright
+2  A: 

Just create a string that contains all of the data in question and then use String.replaceAll() like below.

String result = yourString.replaceAll("&", "&amp;");
Taylor Leese
A: 

Have a look at this method.

IRBMe
Notice the parameters types to replace(char,char) - it does single-character substitution.
Amber
Yeah yeah, fixed immediately after posted.
IRBMe
I think you need to indent the [1] on your link to get it to linkify... maybe?
Mike Cooper
A: 

If you're using Spring you can simply call HtmlUtils.htmlEscape(String input) which will handle the '&' to '&' translation.

Adamski
That is risky because HTML has many more entities defined than pure XML.
Christian Vest Hansen
+1  A: 

Escaping strings can be tricky - especially if you want to take unicode into account. I suppose XML is one of the simpler formats/languages to escape but still. I would recommend taking a look at the StringEscapeUtils class in Apache Commons Lang, and its handy escapeXml method.

Christian Vest Hansen
+1  A: 

You may also want to check to make sure your not replacing an occurrence that has already been replaced. You can use a regular expression with negative lookahead to do this.

For example:

String str = "sdasdasa&amp;adas&dasdasa";
str = str.replaceAll("&(?!amp;)", "&amp;");

This would result in the string "sdasdasa&adas&dasdasa".

The regex pattern "&(?!amp;)" basically says: Match any occurrence of '&' that is not followed by 'amp;'.

Robert Durgin
+6  A: 

The simple answer is:

token = token.replace("&", "&amp;");

Despite the name as compared to replaceAll, replace does do a replaceAll, it just doesn't use a regular expression, which seems to be in order here (both from a performance and a good practice perspective - don't use regular expressions by accident as they have special character requirements which you won't be paying attention to).

Sean Bright's answer is probably as good as is worth thinking about from a performance perspective absent some further target requirement on performance and performance testing, if you already know this code is a hot spot for performance, if that is where your question is coming from. It certainly doesn't deserve the downvotes. Just use StringBuilder instead of StringBuffer unless you need the synchronization.

That being said, there is a somewhat deeper potential problem here. Escaping characters is a known problem which lots of libraries out there address. You may want to consider wrapping the data in a CDATA section in the XML, or you may prefer to use an XML library (including the one that comes with the JDK now) to actually generate the XML properly (so that it will handle the encoding).

Apache also has an escaping library as part of Commons Lang.

Yishai