ansaurus

Question

Answer 1

+4 A:

Try James Tauber's Python Unicode Collation Algorithm. It may not do exactly as you want, but seems well worth a look. For a bit more information about the issues, see this post by Christopher Lenz.

Vinay Sajip 2009-07-08 13:08:24

That at least fixes the generic issue. I guess language sensitive versions of the collation list could be created too.

Lennart Regebro 2009-07-08 13:14:44

Answer 2

+1 A:

Jeff Atwood wrote a good post on Natural Sort Order, in it he linked to a script which does pretty much what you ask.

It's not a trivial script, by any means, but it does the trick.

Simon Scarfe 2009-07-08 13:11:58

Answer 3

A:

To implement it you will need to read about "Unicode collation algorithm" see http://en.wikipedia.org/wiki/Unicode_collation_algorithm

http://www.unicode.org/unicode/reports/tr10/

a sample implementation is here

http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm/

Anurag Uniyal 2009-07-08 13:13:38

Answer 4

+8 A:

IBM's ICU library does that (and a lot more). It has Python bindings: PyICU.

Rafał Dowgird 2009-07-08 13:42:40

Answer 5

+3 A:

I see the answers have already done an excellent job, just wanted to point out one coding inefficiency in Human Sort. To apply a selective char-by-char translation to a unicode string s, it uses the code:

spec_dict = {'Å':'A', 'Ä':'A'}

def spec_order(s):
    return ''.join([spec_dict.get(ch, ch) for ch in s])

Python has a much better, faster and more concise way to perform this auxiliary task (on Unicode strings -- the analogous method for byte strings has a different and somewhat less helpful specification!-):

spec_dict = dict((ord(k), spec_dict[k]) for k in spec_dict)

def spec_order(s):
    return s.translate(spec_dict)

The dict you pass to the translate method has Unicode ordinals (not strings) as keys, which is why we need that rebuilding step from the original char-to-char spec_dict. (Values in the dict you pass to translate [as opposed to keys, which must be ordinals] can be Unicode ordinals, arbitrary Unicode strings, or None to remove the corresponding character as part of the translation, so it's easy to specify "ignore a certain character for sorting purposes", "map ä to ae for sorting purposes", and the like).

In Python 3, you can get the "rebuilding" step more simply, e.g.:

spec_dict = ''.maketrans(spec_dict)

See the docs for other ways you can use this maketrans static method in Python 3.

Alex Martelli 2009-07-08 14:57:16

Answer 6

+1 A:

It is far from a complete solution for your use case, but you could take a look at the unaccent.py script from effbot.org. What it basically does is remove all accents from a text. You can use that 'sanitized' text to sort alphabetically. (For a better description see this page.)

Mark van Lent 2009-07-08 15:18:03

Answer 7

+3 A:

I don't see this in the answers. My Application sorts according to the locale using python's standard library. It is pretty easy.

# python2.5 code below
# corpus is our unicode() strings collection as a list
corpus = [u"Art", u"Älg", u"Ved", u"Wasa"]

import locale
# this reads the environment and inits the right locale
locale.setlocale(locale.LC_ALL, "")
# alternatively, (but it's bad to hardcode)
# locale.setlocale(locale.LC_ALL, "sv_SE.UTF-8")

corpus.sort(cmp=locale.strcoll)

# in python2.x, locale.strxfrm is broken and does not work for unicode strings
# in python3.x however:
# corpus.sort(key=locale.strxfrm)

Question to Lennart and other answerers: Doesn't anyone know 'locale' or is it not up to this task?

kaizer.se 2009-08-23 14:32:43

I don't know, but it seems worth a try.

Lennart Regebro 2009-08-24 06:47:30

By the way 1) I don't thinkn locale.strxfrm is broken for UTF-8 encoded `str'; I benchmarked by application and concluded that using cmp=strcoll on unicode objects is cheaper than decoding all to UTF-8 and using key=strxfrm

kaizer.se 2009-08-24 19:02:42

By the way 2) The locale module will only work with your generated locales (for a Linux box), not any arbitrary locale. "locale -a" will tell you which

kaizer.se 2009-08-24 19:04:00

ansaurus

tags:

views:

answers:

How do I sort alphabetically in Python?

related questions