views:

523

answers:

2

I like to replace a certain set of characters of a string with a corresponding replacement character in an efficent way.

For example:

String sourceCharacters = "šđćčŠĐĆČžŽ";
String targetCharacters = "sdccSDCCzZ";

String result = replaceChars("Gračišće", sourceCharacters , targetCharacters );

Assert.equals(result,"Gracisce") == true;

Is there are more efficient way than to use the replaceAll method of the String class?

My first idea was:

final String s = "Gračišće";
String sourceCharacters = "šđćčŠĐĆČžŽ";
String targetCharacters = "sdccSDCCzZ";

// preparation
final char[] sourceString = s.toCharArray();
final char result[] = new char[sourceString.length];
final char[] targetCharactersArray = targetCharacters.toCharArray();

// main work
for(int i=0,l=sourceString.length;i<l;++i)
{
  final int pos = sourceCharacters.indexOf(sourceString[i]);
  result[i] = pos!=-1 ? targetCharactersArray[pos] : sourceString[i];
}

// result
String resultString = new String(result);

Any ideas?

Btw, the UTF-8 characters are causing the trouble, with US_ASCII it works fine.

+6  A: 

You can make use of java.text.Normalizer and a shot of regex to get rid of the diacritics of which there exist much more than you have collected as far.

Here's an SSCCE, copy'n'paste'n'run it on Java 6:

package com.stackoverflow.q2653739;

import java.text.Normalizer;
import java.text.Normalizer.Form;

public class Test {

    public static void main(String... args) {
        System.out.println(removeDiacriticalMarks("Gračišće"));
    }

    public static String removeDiacriticalMarks(String string) {
        return Normalizer.normalize(string, Form.NFD)
            .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
    }
}

This should yield

Gracisce

At least, it does here at Eclipse with console character encoding set to UTF-8 (Window > Preferences > General > Workspace > Text File Encoding). Ensure that the same is set in your environment as well.

As an alternative, maintain a Map<Character, Character>:

Map<Character, Character> charReplacementMap = new HashMap<Character, Character>();
charReplacementMap.put('š', 's');
charReplacementMap.put('đ', 'd');
// Put more here.

String originalString = "Gračišće";
StringBuilder builder = new StringBuilder();

for (char currentChar : originalString.toCharArray()) {
    Character replacementChar = charReplacementMap.get(currentChar);
    builder.append(replacementChar != null ? replacementChar : currentChar);
}

String newString = builder.toString();
BalusC
with this solution i get: GraA?iA¡Ae. and btw, i'd like to replace not only diacritic characters but some others of other languages too. so i really would like to know a solution that works for an arbitrary mapping.
ManBugra
Exactly. The problem is that the diacritics are sometimes combined, sometimes not, and string character-by-character replace gets confused because there are actually two characters, not one.
Mr. Shiny and New
@Mr. Shiny and New: yes, System.out.println("š".toCharArray().length); outputs '2'
ManBugra
@Mr. Shiny and @ManBurga: The `.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");` should take care about removing the combining diacritical marks. Maybe you removed this line? Or you're running an ancient Java version? The above has worked fine for years here and it works for an arbitrary mapping expect of certain Polish characters such as a l with a hyphen through it, since it's not an diacritic.
BalusC
@BalusC: java1.6 on Vista using IntelliJ IDEA, and sorry, i just cant get it working. can you please edit your post and add the imports?
ManBugra
Done. It's by the way the IDE console which needs to be set to UTF-8. I tried to reproduce here with the console set to ISO-8859-1 and I got the same as you.
BalusC
@BalusC: yes, console settings was f*d up. it works now. but still, i need a function for an arbitrary character mapping.
ManBugra
I edited it in.
BalusC
A: 

I'd use the replace method in a simple loop.

String sourceCharacters = "šđćčŠĐĆČžŽ";
String targetCharacters = "sdccSDCCzZ";

String s = "Gračišće";
for (int i=0 ; i<sourceCharacters.length() ; i++)
    s = s.replace(sourceCharacters.charAt[i], targetCharacters.charAt[i]);

System.out.println(s);
Donal Fellows
each iteration would create a new string object. would be nice to do it 'in place'
ManBugra
Firstly, each iteration only makes a new object if a change is done; if the character being searched for isn't there, the original object is returned. Secondly, it's *far* more annoying to write this code using `StringBuilder` or `StringBuffer` as you have to do all the work yourself; since Java's memory management is tuned for rapid object turnover anyway, it's easier to do it the way I showed instead of trying to figure out how to be efficient. You can always optimize later if really necessary (i.e., if it is a real bottleneck).
Donal Fellows
@Donal Fellows: yes your are right at your first point. but i dont agree with your second. you write efficient code once, even it's annoying, and than reuse it. anyway BalusC solved the riddle.
ManBugra