views:

212

answers:

1

If I have a string of UTF-8 characters and they need to be output to an older system as UTF-7 I have two questions pertaining to this.

  1. How can I convert a string s which has UTF-8 characters to the same string without those characters efficiently?

  2. Are there any simple of converting extended characters like 'Ō' to their closest non extended equivalent 'O'?

+2  A: 

If the older system can actually handle UTF-7 properly, why do you want to remove anything? Just encode the string as UTF-7:

string text = LoadFromWherever(Encoding.UTF8);
byte[] utf7 = Encoding.UTF7.GetBytes(text);

Then send the UTF-7-encoded text down to the older system.

If you've got the original UTF-8-encoded bytes, you can do this in one step:

byte[] utf7 = Encoding.Convert(Encoding.UTF8, Encoding.UTF7, utf8);


If you actually need to convert to ASCII, you can do this reasonably easily.

To remove the non-ASCII characters:

var encoding = Encoding.GetEncoding
    ("us-ascii", new EncoderReplacementFallback(""), 
     new DecoderReplacementFallback(""));
byte[] ascii = encoding.GetBytes(text);

To convert non-ASCII to nearest equivalent:

string normalized = text.Normalize(NormalizationForm.FormKD);
var encoding = Encoding.GetEncoding
    ("us-ascii", new EncoderReplacementFallback(""), 
     new DecoderReplacementFallback(""));
byte[] ascii = encoding.GetBytes(normalized);
Jon Skeet