ansaurus

Question

Python not sorting unicode properly. Strcoll doesn't help.

Answer 1

A:

On ubuntu lucid the sorting with cmp seems to work ok, but my output encoding is wrong.

>>> import locale   
>>> locale.setlocale(locale.LC_ALL, 'pl_PL.UTF-8')
'pl_PL.UTF-8'
>>> print [i for i in sorted([u'a', u'z', u'ą'], cmp=locale.strcoll)]
[u'a', u'\u0105', u'z']

Using key with locale.strxfrm does not work unless I am missing something

>>> print [i for i in sorted([u'a', u'z', u'ą'], key=locale.strxfrm)]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0105' in position 0: ordinal not in range(128)

gnibbler 2010-08-05 09:33:31

With strxfrm You have to manually decode the unicode string AFAIK.

tkopczuk 2010-08-05 09:38:14

@tkopczuk, It would be nice to find a way to sort using `key` as `cmp` for `sorted` is gone in Python3

gnibbler 2010-08-05 10:28:26

@gnibbler, It seems to be working fine with the provided functools.cmp_to_key function (`from functools import cmp_to_key`), like that: `sorted([u'a', u'z', u'ą'], key=cmp_to_key(collator.compare))`

tkopczuk 2010-08-05 11:52:50

Answer 2

+4 A:

Apparently, the only way for sorting to work on all platforms is to use the ICU library with PyICU bindings (http://pyicu.osafoundation.org/).

On OS X: sudo port install py26-pyicu, minding bug described here: https://svn.macports.org/ticket/23429 (oh the joy of using macports).

PyICUs documentation is unfortunately severely lacking, but I managed to find out how it's done:

import PyICU
collator = PyICU.Collator.createInstance(PyICU.Locale('pl_PL.UTF-8'))
print [i for i in sorted([u'a', u'z', u'ą'], cmp=collator.compare)]

which gives:

[u'a', u'ą', u'z']

Another pro - @bobince: it's thread-safe, so not useless when setting request-wise locales.

tkopczuk 2010-08-05 09:37:04

Good question, and good answer -- and you're ahead of everyone by a few steps, which is no wonder if you're in Poland :) . Anyhow, this is the second time I've seen issues with Python where it relies on underlying C libraries. Do you know where these could be brought up?

chryss 2010-08-05 09:44:18

I think it might be a problem with the libraries themselves, rather than Python. But as gnibbler pointed out - it happens to work in some OSes, so maybe, at least this particular issue, has been fixed at some point. OS X is famous for using old gcc and so, and the other OS I tested was Fedora 8 - which itself is not quite contemporary. I would bring this up at one of the mailing lists for the underlying C libraries.Cheers mate :)

tkopczuk 2010-08-05 09:58:03

I agree. I made a Gist http://gist.github.com/509520 and will give it to a few people to try out. I *love* i18n, but the bugs make it tedious.

chryss 2010-08-05 10:34:33

Answer 3

A:

Just to add to tkopczuk's investigation: This is definitely a gcc bug, at least for version 4.2.1 on OS X 10.6.4. It can be reproduced by calling C strcoll() directly as in this snippet.

EDIT: Still on the same system, I find that for the UTF-8 versions of de_DE, fr_FR, pl_PL, the problem is there, but for the ISO-88591 versions of fr_FR and de_DE, sort order is correct. Unfortunately for the OP, ISO-88592 pl_PL is also buggy:

The order for Polish ISO-8859 is:
LATIN SMALL LETTER A
LATIN SMALL LETTER Z
LATIN SMALL LETTER A WITH OGONEK
The LC_COLLATE culture and encoding settings were pl_PL, ISO8859-2.

The order for Polish Unicode is:
LATIN SMALL LETTER A
LATIN SMALL LETTER Z
LATIN SMALL LETTER A WITH OGONEK
The LC_COLLATE culture and encoding settings were pl_PL, UTF8.

The order for German Unicode is:
LATIN SMALL LETTER A
LATIN SMALL LETTER Z
LATIN SMALL LETTER A WITH DIAERESIS
The LC_COLLATE culture and encoding settings were de_DE, UTF8.

The order for German ISO-8859 is:
LATIN SMALL LETTER A
LATIN SMALL LETTER A WITH DIAERESIS
LATIN SMALL LETTER Z
The LC_COLLATE culture and encoding settings were de_DE, ISO8859-1.

The order for Fremch ISO-8859 is:
LATIN SMALL LETTER A
LATIN SMALL LETTER E WITH ACUTE
LATIN SMALL LETTER Z
The LC_COLLATE culture and encoding settings were fr_FR, ISO8859-1.

The order for French Unicode is:
LATIN SMALL LETTER A
LATIN SMALL LETTER Z
LATIN SMALL LETTER E WITH ACUTE
The LC_COLLATE culture and encoding settings were fr_FR, UTF8.

chryss 2010-08-05 23:09:01

Is it possible to decompile `/usr/share/locale/pl_PL.UTF-8/LC_COLLATE` to some sort of readable form? Might not be a gcc bug after all, but wrong collation tables, as @bobince pointed out.

tkopczuk 2010-08-06 07:17:31

Well, I get the same behaviour for German and French (ie, characters with diacritics are sorted after "z"), so it's not just the Polish collation tables. I wonder if it doesn't just pick C locale or maybe the default locale (mine is en_GB -- is yours pl_PL?). In any event, it's clearly in the C library, whether in the data or in the code I can't tell.

chryss 2010-08-06 08:09:31

Yup, mine is pl_PL. But it would be nice to check the collation tables and if they're kosher, then there's the problem with different locale settings being used by the library. But I guess it's the library, hence the problems on various OSes.

tkopczuk 2010-08-06 14:47:48

I don't know about how the platform-specific collation tables are made, except that they're supposed to be made from the Common Locale Repository http://cldr.unicode.org/ . The more I look into this, the more I think the C library is a very minimal way to account for locale anyway, and that you're better off using ICU for serious work.Above more testing -- de_DE and fr_FR ISO locales are ok, but pl_PL is also buggy for ISO.

chryss 2010-08-06 15:37:48

ansaurus

tags:

views:

answers:

Python not sorting unicode properly. Strcoll doesn't help.

related questions