views:

120

answers:

2

How do you sort Chinese, Japanese and Korean (CJK) characters in Perl?

As far as I can tell, sorting CJK characters by stroke count, then by radical, seems to be the way these languages are sorted. There are also some methods that sort by sounds, but this seems less common.

I've tried using:

perl -e 'print join(" ", sort qw(工 然 一 人 三 古 二 )), "\n";'
# Prints: 一 三 二 人 古 工 然 which is incorrect

And I've tried using Unicode::Collate from CPAN, but it says:

By default, CJK Unified Ideographs are ordered in Unicode codepoint order...

If I could get a database of stroke count per character, I could easily sort all of the characters, but this doesn't seem to come with Perl nor is it encapsulated in any module I could find.

If you know how to sort CJK in other languages, it would be helpful to mention it in an answer to this question.

+2  A: 

See TR38 for the dirty details and corner cases. It's not as easy as you think and as this code sample looks like.

use 5.010;
use utf8;
use Encode;
use Unicode::Unihan;
my $u = Unicode::Unihan->new;

say encode_utf8 sprintf "Character $_ has the radical #%s and %d residual strokes." , split /[.]/, $u->RSUnicode($_) for qw(工 然 一 人 三 古 二);
__END__
Character 工 has the radical #48 and 0 residual strokes.
Character 然 has the radical #86 and 8 residual strokes.
Character 一 has the radical #1 and 0 residual strokes.
Character 人 has the radical #9 and 0 residual strokes.
Character 三 has the radical #1 and 2 residual strokes.
Character 古 has the radical #30 and 2 residual strokes.
Character 二 has the radical #7 and 0 residual strokes.

See http://en.wikipedia.org/wiki/List_of_Kangxi_radicals for a mapping from radical ordinal number to stroke count.

daxim
Do you know how to use the Unicode::Collate module? Specifically do you know how to pass a sub{} as the overrideCJK parameter, and have it actually run when Unicode::Collate->sort() is run? I could use Unicode::Unihan to get the stroke count and radical info to actually sort characters, but the overrideCJK function doesn't execute.
Neil
No, but you can [open a new question](http://stackoverflow.com/questions/ask) for that topic.
daxim
Considering how silly the question is, an answer as silly as this deserves to be accepted. There is no meaning to the notion of "sorting CJK characters".
Kinopiko
The bigger part of the question is about sorting by stroke count, which is easily achieved. Don't make me call you a fool.
daxim
@daxim: Do you have a specific example of where someone has needed or would ever need to sort Chinese characters without regard to the underlying language? It's a silly question, and a silly answer.
Kinopiko
@Kinopiko: I meant "sorting CJK phrases", which you need to do in the same situations when you sort English phrases, such as in index of a book, or whenever you want to write a list where people can find things. However, to sort a phrase you need to first sort characters.
Neil
@Neil: If you want to sort Japanese phrases, there is an answer for that. If you want to sort Chinese phrases, that is another question. If you want to sort Korean phrases, that is another question. But there is no such thing as "sorting CJK phrases" - it doesn't mean anything to sort words from three different languages.
Kinopiko
+1  A: 

A Japanese phonebook is sorted on a phonetic basis (gojûon collation). However, kanji character order is not based on phonetics, no matter whether in Unicode, JIS, S-JIS or EUC. Only kana are based on phonetic order. This means you can not collate meaningfully without phonetic conversion!

For example:

a) kanji:           東京駅
b) kana converted:  とうきょうえき
c) romanisation:    tôkyô eki

With b) or c), you can make a meaningful sort. But you can not do with only a). Of course, you can run the plain sort function, but it is not meaningful for Japanese.

kmugitani
That's answering a sane question, "How do you sort Japanese words?", but it doesn't answer the question which was actually asked, so I can't upvote it.
Kinopiko
@Kinopiko: Yah, I have to agree with you. Original question is not good one.
kmugitani