ansaurus

Question

How do you sort CJK (Asian) characters in Perl, or with any other programming language?

Answer 1

+2 A:

See TR38 for the dirty details and corner cases. It's not as easy as you think and as this code sample looks like.

use 5.010;
use utf8;
use Encode;
use Unicode::Unihan;
my $u = Unicode::Unihan->new;

say encode_utf8 sprintf "Character $_ has the radical #%s and %d residual strokes." , split /[.]/, $u->RSUnicode($_) for qw(工 然 一 人 三 古 二);
__END__
Character 工 has the radical #48 and 0 residual strokes.
Character 然 has the radical #86 and 8 residual strokes.
Character 一 has the radical #1 and 0 residual strokes.
Character 人 has the radical #9 and 0 residual strokes.
Character 三 has the radical #1 and 2 residual strokes.
Character 古 has the radical #30 and 2 residual strokes.
Character 二 has the radical #7 and 0 residual strokes.

See http://en.wikipedia.org/wiki/List_of_Kangxi_radicals for a mapping from radical ordinal number to stroke count.

daxim 2010-10-08 19:34:57

Do you know how to use the Unicode::Collate module? Specifically do you know how to pass a sub{} as the overrideCJK parameter, and have it actually run when Unicode::Collate->sort() is run? I could use Unicode::Unihan to get the stroke count and radical info to actually sort characters, but the overrideCJK function doesn't execute.

Neil 2010-10-08 20:28:42

No, but you can [open a new question](http://stackoverflow.com/questions/ask) for that topic.

daxim 2010-10-08 21:04:27

Considering how silly the question is, an answer as silly as this deserves to be accepted. There is no meaning to the notion of "sorting CJK characters".

Kinopiko 2010-10-09 16:13:41

The bigger part of the question is about sorting by stroke count, which is easily achieved. Don't make me call you a fool.

daxim 2010-10-09 16:22:41

@daxim: Do you have a specific example of where someone has needed or would ever need to sort Chinese characters without regard to the underlying language? It's a silly question, and a silly answer.

Kinopiko 2010-10-10 00:23:40

@Kinopiko: I meant "sorting CJK phrases", which you need to do in the same situations when you sort English phrases, such as in index of a book, or whenever you want to write a list where people can find things. However, to sort a phrase you need to first sort characters.

Neil 2010-10-11 04:39:29

@Neil: If you want to sort Japanese phrases, there is an answer for that. If you want to sort Chinese phrases, that is another question. If you want to sort Korean phrases, that is another question. But there is no such thing as "sorting CJK phrases" - it doesn't mean anything to sort words from three different languages.

Kinopiko 2010-10-11 06:13:04

Answer 2

+1 A:

A Japanese phonebook is sorted on a phonetic basis (gojûon collation). However, kanji character order is not based on phonetics, no matter whether in Unicode, JIS, S-JIS or EUC. Only kana are based on phonetic order. This means you can not collate meaningfully without phonetic conversion!

For example:

a) kanji:           東京駅
b) kana converted:  とうきょうえき
c) romanisation:    tôkyô eki

With b) or c), you can make a meaningful sort. But you can not do with only a). Of course, you can run the plain sort function, but it is not meaningful for Japanese.

kmugitani 2010-10-09 06:56:55

That's answering a sane question, "How do you sort Japanese words?", but it doesn't answer the question which was actually asked, so I can't upvote it.

Kinopiko 2010-10-09 16:14:42

@Kinopiko: Yah, I have to agree with you. Original question is not good one.

kmugitani 2010-10-10 07:07:31

ansaurus

tags:

views:

answers:

How do you sort CJK (Asian) characters in Perl, or with any other programming language?

related questions