views:

186

answers:

3

I have a strange problem replacing chars in string...

I read a .txt file containing russian text, and starting from a list of letters russian to english (ru=en), I loop the list and I WOULD like to replace russian characters with english characters.

The problem is: I can see in the debug the right reading of the russian and the right reading of the english, but using myWord = myWord.Replace(ruChar, enChar) the string is not replaced.

My txt file is a UTF-8 encoding.

A: 

Don't -1 me if this doesnt work, I'm just guessing that you must UTF-8 English string that you want to replace, like so for example:

string myWord = Encoding.UTF8.GetString(Encoding.ASCII.GetBytes(myWord));
myWord = myWord.Replace("слово", Encoding.UTF8.GetString(Encoding.ASCII.GetBytes("letter")));

I'm assuming that myWord is in ASCII so the first line of code converts it to UTF-8 string, but left it out if it is UTF-8.

Second line converts English word to UTF-8 so it can be replaced over the Russian word.

Cipi
A: 

Very strange

Console.WriteLine("слово".Replace("слово", "word")); // prints 'word'

Works as planned. Maybe because I have set Russian as non-unicode system language..

abatishchev
Doesn't work for me... I did the same in Serbian... Well system setting I guess. =D And, SLOVO means LETTER, not WORD. =p
Cipi
Slovo means exactly word, Bukva is letter ...
Radoslav Hristov
@Cipi: In Serbian yes, means. In Russian - слово (word), буква (letter) :)
abatishchev
REALLY!!!!!!???? =D Damn my Russian sucks... I apologize. =|
Cipi
+2  A: 

String.Replace() is going to be horribly inefficient, you'll have to call it for each possible Cyrillic letter you'd want to replace. Use a Dictionary instead (no pun intended). For example:

    private const string Cyrillic = "AaБбВвГг...";
    private const string Latin = "A|a|B|b|V|v|G|g|...";
    private Dictionary<char, string> mLookup;

    public string Romanize(string russian) {
        if (mLookup == null) {
            mLookup = new Dictionary<char, string>();
            var replace = Latin.Split('|');
            for (int ix = 0; ix < Cyrillic.Length; ++ix) {
                mLookup.Add(Cyrillic[ix], replace[ix]);
            }
        }
        var buf = new StringBuilder(russian.Length);
        foreach (char ch in russian) {
            if (mLookup.ContainsKey(ch)) buf.Append(mLookup[ch]);
            else buf.Append(ch);
        }
        return buf.ToString();
    }

Note how the bars and the Split() function are necessary in the Latin replacement because some Cyrillic letters require more than one letter for their transliteration. Key idea is to use a dictionary for fast lookup and a string builder for fast string construction.

This United Nations document might be helpful.

Hans Passant
Purely nitpicking, but TryGetValue() would be more suited than ContainsKey() there, I think
ohadsc