Does ICU handle the collation of a list of strings of varying languages?

views:

answers:

+1 Q:

Does ICU handle the collation of a list of strings of varying languages?

My application may have strings comprised of different alphabets / languages in a single list. I can't seem to find any information on what the correct method for sorting these should be or any indication that ICU supports this functionality.

Example List:

Apple
яблоко
μήλο
Baby
βρέφος
ребенок

+3 A:

There is no sensible way to do this well. There is no universal sort for all languages, even within the same alphabet. Different languages (cultures, basically) have come up with different collation rules for how words should be sorted.

The only way to do this consistently at all, I think, is to use plain old codepoint sorting (e.g. in Java, String.compareTo).

You could come up with some heuristics, depending on what your data represents. You could group the strings based on guesses about the alphabet and language, and then use locale-specific sorting for each group. But you'd have to do this the hard way (code it yourself), I think, because you would guess differently depending on the terms (e.g. is 'mar' the English verb or the Spanish noun?). It's conceivable that you would end up with a worse result than the naive Unicode numerical sort, in terms of unpredictable "errors".

As with anything else, it depends on how much you can afford to put into the solution, and what kind of performance you need.

This suggestion is not the answer you're looking for: if there's any way to identify the locale when initially storing the strings, you should do so, and record it as part of the string's metadata. Then you won't have this problem.

Zac Thompson 2009-09-13 05:55:08

+2 A:

As mentioned by @Zac there is no universal sort. A code point sort will be consistent, but may not be what the user expects.

So you should probably use the preferred sort order for the user's selected locale. Any code points not defined in that sort order will be grouped together.

devstuff 2009-09-13 23:13:18

You could transliterate into your 'target' language (all in one script) and then sort. But languages have conflicting rules for sorting.

Steven R. Loomis 2009-10-07 17:43:28

Withe all the caveats above, here is one "standard universal multilingual sorting" : the unicode collation algorithm (UCA), which is NOT the codepoint order. From a cursory glance at this page, ICU seems to handle the mixture of UCA and local preference.

Frédéric Grosshans 2010-03-19 12:02:50

ansaurus

tags:

views:

answers:

Does ICU handle the collation of a list of strings of varying languages?

related questions