views:

42

answers:

1

I would like to be able to say "Normalize this string by forcing diacritic accents into their combining form".

Details: My code is being developed in C# but I don't believe the issue to be language specific.

There are two problems with my data (1) the diacritic is preceding the base character in this data (it needs to follow the base character in Unicode forms D or KD). (2) the accent diacritic in my data is a Greek Tonos (U+0384) but needs to be combining form (U+0301) in order to Normalize.

I would like to do this programmatically. I would think that this type of operation should be well known but I did not find support in the C# Globalization methods (There are normalization methods but there is no way to force the diacritic accents into their combining form).

Any insight would be appreciated.

A: 

I do not think that the C# Globalization methods can help you here. The issue as you pointed out is that U+0384 is not a combining charcter. It is a character by itself. This also can be seen from the compatibilty decomposition ( To U+0020 U+0301). The data set most likely comes from a source that would display the tonos as a diacritic on the next character. This is not "proper" according to the unicode spec. Thus you'll have to convert the data yourself. I have run into a similar issue with the apostrophe; sometimes the right quotation mark is being used by applications.

The data conversion is not hard, I'm sure you can code that up. I would have a stateful converter and stream the data through. When U+0384 gets detected, it does not get emmetied. You sticth to the "tonos" state and emit U+0301 after the NEXT character. The are error conditions to be handled (U+0384 runs, end of data in "tonos" state). This data can be normalized with the usual APIs. Good Luck.

Dominik Weber