ansaurus

Question

What does sorting mean in double-byte languages?

Answer 1

+1 A:

Yes, the characters get compared. They are usually compared based on their Unicode code points, though, which are quite different between hiragana and kanji -- making the sort potentially useless in Japanese. (Kanji borrowed from Chinese, but the order they'd appear in Chinese doesn't correspond to the order of the hiragana that'd represent the same meaning). There are collations that could render some of the characters "equal" for comparison purposes, but i don't know if there's one that'll consider a kanji to be equivalent to the hiragana that'd comprise its pronunciation -- especially since a character can have a number of different pronunciations.

In Chinese or Korean, or other languages that don't have 3 different alphabets (one of which is quite irregular), it'd probably be less of an issue.

cHao 2010-09-21 20:46:58

Answer 2

+1 A:

Those are sorted by codepoint value, ascending. This is certainly meaningless for human readers. It's not impossible to devise a sensible sorting scheme for Japanese, but sorting Chinese characters is hard (partly because we don't necessarily know whether we're looking at Japanese or Chinese), and lot of programmers punt to this solution.

Chuck 2010-09-21 20:48:06

Answer 3

+3 A:

Strings are compared character by character where the code point value defines the order:

The comparison of strings uses a simple lexicographic ordering on sequences of code point value values. There is no attempt to use the more complex, semantically oriented definitions of character or string equality and collating order defined in the Unicode specification. Therefore strings that are canonically equal according to the Unicode standard could test as unequal. In effect this algorithm assumes that both strings are already in normalised form.

If you need more than this, you will need to use a string comparison that can take collations into account.

Gumbo 2010-09-21 20:54:05

Thanks very much for a thoughtful and comprehensive answer. Please see the addendum to my question.

Robusto 2010-09-22 13:44:16

Answer 4

A:

Recall that in JavaScript, you can pass into sort() a function in which you can implement sort yourself, in order to achieve a sort that matters to humans:

myarray.sort(function(a,b){

//return 0, 1, or -1 based on the comparison of the two strings

});

James Connell 2010-09-21 20:54:51

Thanks, but I already know how to compare two strings in a sort function. What I'm trying to get at is what the comparison should strive for in comparing two double-byte values in order to be useful to the reader of the language.

Robusto 2010-09-21 21:00:39

Answer 5

+18 A:

Does one double-byte character really get compared against the other in a sort function?

The native String type in JavaScript is based on UTF-16 code units, and that's what gets compared. For characters in the Basic Multilingual Plane (which all these are), this is the same as Unicode code points.

The term ‘double-byte’ as in encodings like Shift-JIS has no meaning in a web context: DOM and JavaScript strings are natively Unicode, the original bytes in the encoded page received by the browser are long gone.

Does the result of such a sort mean anything at all?

Little. Unicode code points do not claim to offer any particular ordering... for one, because there is no globally-accepted ordering. Even for the most basic case of ASCII Latin characters, languages disagree (eg. on whether v and w are the same letter, or whether the uppercase of i is I or İ). And CJK gets much gnarlier than that.

The main Unicode CJK Unified Ideographs block happens to be ordered by radical and number of strokes (Kangxi dictionary order), which may be vaguely useful. But use characters from any of the other CJK extension blocks, or mix in some kana, or romaji, and there will be no meaningful ordering between them.

The Unicode Consortium do attempt to define some general ordering rules, but it's complex and not generally attempted at a language level. Systems that really need language-sensitive sorting abilities (eg. OSes, databases) tend to have their own collation schemes.

This is different from the ordering of the Japanese syllabary

Yes. Above and beyond collation issues in general, it's a massively difficult task to handle kanji accurately by syllable, because you have to guess at the pronunciation. JavaScript can't realistically know that by ‘藤本’ you mean ‘Fujimoto’ and not ‘touhon’; this sort of thing requires in-depth built-in dictionaries and still-unreliable heuristics... not the sort of thing you want to build in to a programming language.

bobince 2010-09-21 21:17:18

Thanks very much for a thoughtful and comprehensive answer. Please see the addendum to my question.

Robusto 2010-09-22 13:43:21

Also, you're right that the different readings (onyomi and kunyomi) for each character would make it virtually impossible to aim at anything like a phonetic ordering in Japanese. I hadn't thought of that, but I should have.

Robusto 2010-09-22 13:49:19

Answer 6

+6 A:

You could implement the Unicode Collation Algorithm in Javascript if you want something better than the default JS sort for strings. Might improve some things. Though as the Unicode doc states:

Collation is not uniform; it varies according to language and culture: Germans, French and Swedes sort the same characters differently. It may also vary by specific application: even within the same language, dictionaries may sort differently than phonebooks or book indices. For non-alphabetic scripts such as East Asian ideographs, collation can be either phonetic or based on the appearance of the character.

The Wikipedia article points out that since collation is so tough in non-alphabetic scripts, now a days the answer is to make it very easy to look up information by entering characters, rather than by looking through a list.

I suggest that you talk to truly knowledgeable end users of your application to see how they would best like it to behave. The problem of ordering Chinese characters is not unique to your application.

Also, if you don't want to implement the collation in your system, another solution would for you to create a Ajax service that stores the names in a MySql or other database, then looks up the data with an order statement.

Larry K 2010-09-21 21:18:07

Thanks very much for a thoughtful and comprehensive answer. Please see the addendum to my question.

Robusto 2010-09-22 13:43:57

Answer 7

+3 A:

Others have answered the other questions, I will take on this one:

what should one strive for in creating a compare function for those languages?

One way to do it is that, you will need to create a program that can "read" the characters; that is, able to map hanzi/kanji characters to their "sound" (pinyin/hiragana reading). At the simplest level, this means a database that maps hanzi/kanji to sounds. Of course this is more difficult than it sounds (pun not intended), since a lot of characters can have different pronunciations in different contexts, and Chinese have many different dialects to consider.

Another way, is to order by stroke order. This means there would need to be a database that maps hanzi/kanji to their strokes. Another problem: Chinese and Japanese writes in different stroke orders. However, aside from Japanese and Chinese difference, using stroke ordering is much more consistent within a single text, since hanzi/kanji characters are almost always written using the same stroke order irrespective of what they meant or how they are read. A similar idea is to sort by radicals instead of plain stroke orders.

The third way, is sorting by Unicode code points. This is simple, and always gives undisputably consistent ordering; however, the problem is that the sort order is meaningless for human.

The last way is to rethink about the need for absolute ordering, and just use some heuristic to sort by relevance to the user's need. For example, in a shopping cart software, you can sort depending on user's buying habits or by price. This kinda avoids the problem, but most of the time it works (except if you're compiling a dictionary).

As you notice, the first two methods require creating a huge database of one-to-many mapping, but they still doesn't always give a useful result. The third method also require a huge database, but many programming languages already have this database built into the language. The last way is a bit of heuristic, probably most useful, however they are doomed to never give consistent ordering (much worse than the first two method).

Lie Ryan 2010-09-21 21:54:28

Thanks very much for a thoughtful and comprehensive answer. Please see the addendum to my question.

Robusto 2010-09-22 13:45:22

Answer 8

+1 A:

The normal string comparison functions in many programming languages are designed to ensure that strings can be sorted into a unique order, to allow algorithms like binary search and duplicate-detection to work correctly. To sort data in a fashion meaningful to a human reader, one must know what the data represents. For example, in a list of English movie titles, "El Mariachi" would typically sort under "E", but in a list of Spanish movie titles it would sort under "M". The application will need information beyond that contained in the strings themselves to know how the strings should be sorted.

supercat 2010-09-21 22:22:07

Answer 9

+1 A:

The answers to Q1 (can you sort) and Q3 (is sort meaningful) are both "yes" for Chinese (from a mainland perspective). For Q2 (how to sort):

All Chinese characters have definite pronunciation (some are polyphonic) as defined in pinyin, and it's far more common (as in virtually all Chinese dictionaries) to sort by pinyin, where there is no ambiguity. Characters with the same pronunciation are then sorted by stroke order.

The polyphonic characters pose extra challenge for sorting, as their pinyin usually depends on the word they are in (I heard Japanese characters could be even more hairy). For example, the character 阿 is pronounced a(1) in 阿姨 (tone in parenthesis), and e(1) in 阿胶. So if you need to sort words or sentences, you cannot simply look at one character at a time from each item.

Geoffrey Zheng 2010-09-22 02:34:04

Thanks very much for a thoughtful and comprehensive answer. Please see the addendum to my question.

Robusto 2010-09-22 13:45:45

ansaurus

tags:

views:

answers:

What does sorting mean in double-byte languages?

related questions