views:

215

answers:

1

I had a question about string normalization and it was already answered, but the problem is, I cannot correctly normalize korean characters that require 3 keystrokes
With the input "ㅁㅜㄷ"(from keystrokes "ane"), it comes out "무ㄷ" instead of "묻".
With the input "ㅌㅐㅇ"(from keystrokes "xod"), it comes out "태ㅇ" instead of "탱".

This is Mr. Dean's answer and while it worked on the example I gave at first...it doesn't work with the one's I cited above.

If you are using .NET, the following will work:

var s = "ㅌㅐㅇ";
s = s.Normalize(NormalizationForm.FormKC);

In native Win32, the corresponding call is NormalizeString:

wchar_t *input = "ㅌㅐㅇ";
wchar_t output[100];
NormalizeString(NormalizationKC, input, -1, output, 100);

NormalizeString is only available in Windows Vista+. You need the "Microsoft Internationalized Domain Name (IDN) Mitigation APIs" installed if you want to use it on XP (why it's in the IDN download, I don't understand...)

Note that neither of these methods actually requires use of the IME - they work regardless of whether you've got the Korean IME installed or not.

This is the code I'm using in delphi (with XP):

      var  buf: array [0..20] of char;
      temporary: PWideChar;
      const NORMALIZATIONKC=5;
      ...
      temporary:='ㅌㅐㅇ';
      NormalizeString(NORMALIZATIONKC , temporary, -1, buf, 20);
      showmessage(buf);

Is this a bug? Is there something incorrect in my code? Does the code run correctly on your computer? In what language? What windows version are you using?

+2  A: 

The jamo you're using (ㅌㅐㅇ)are in the block called Hangul Compatibility Jamo, which is present due to legacy code pages. If you were to take your target character and decompose it (using NFKD), you get jamo from the block Hangul Jamo (ᄐ ᅢ ᆼ, sans the spaces, which are just there to prevent the browser from normalizing it), and these can be re-composed just fine.

Unicode 5.2 states:

When Hangul compatibility jamo are transformed with a compatibility normalization form, NFKD or NFKC, the characters are converted to the corresponding conjoining jamo characters.

(...)

Table 12-11 illustrates how two Hangul compatibility jamo can be separated in display, even after transforming them with NFKD or NFKC.

This suggests that NFKC should combine them correctly by treating them as regular Jamo, but Windows doesn't appear to be doing that. However, using NFKD does appear to convert them to the normal Jamo, and you can then run NFKC on it to get the right character.

Since those characters appear to come from an external program (the IME), I would suggest you either do a manual pass to convert those compatibility Jamo, or start by doing NFKD, then NFKC. Alternatively, you may be able to reconfigure the IME to output "normal" Jamo instead of comaptibility Jamo.

Michael Madsen
For some strange reason, I canoot get the NFKC to work after the NFKD (it just refuses to normalize again). I cannot get my IME (Korean Input System IME 2002) to output regular jamo, I'm thinking of replacing it, but I'm still trying to figure out how to install my downloads (I can't understand Korean!)Thank you for answering!
Dian