views:

36

answers:

1

In my VB.NET application I compare words that are recorded using IPA, many of which have many diacritic marks. In one of the comparisons, I compare the words character by character. But when I iterate over the characters, the diacritic marks come out as separate characters (as I would expect since this is unicode):

o`ku`ku`

However, a u character is different than a u plus an accent for the purposes of this program and needs to be distinguished.

Is there a good way to iterate over unicode strings in such a way that characters and their accents are considered one character? I'm trying to avoid having to hardcode all the combinations that combine to be considered a single character.

Edit:

The Normalize() method does work for characters with simple diacritic marks that have a single-character unicode representation, such as most accented vowels. However, this does not work for more obscure symbols, like and .

+3  A: 

That's what string.Normalize() takes care of. You can use the Normalize(NormalizationForm) override to control this explicitly.

Hans Passant
This works for simple accent marks, but some of the more complicated IPA characters are not combined (because there is no single-character representation for them). For example, the ˤ modifier does not get combined. I will update my question to reflect this.
dvcolgan
How could that be a problem? The words you're trying to compare won't have the combining glyph either.
Hans Passant
The problem is that the combining glyphs are important information for the purposes of this program, and not having them changes the calculations. ɔ̃ is a completely different character than ɔ.
dvcolgan