views:

1502

answers:

7

Python sorts by byte value by default, which means é comes after z and other equally funny things. What is the best way to sort alphabetically in Python?

Is there a library for this? I couldn't find anything. Preferrably sorting should have language support so it understands that åäö should be sorted after z in Swedish, but that ü should be sorted by u, etc. Unicode support is thereby pretty much a requirement.

If there is no library for it, what is the best way to do this? Just make a mapping from letter to a integer value and map the string to a integer list with that?

+4  A: 

Try James Tauber's Python Unicode Collation Algorithm. It may not do exactly as you want, but seems well worth a look. For a bit more information about the issues, see this post by Christopher Lenz.

Vinay Sajip
That at least fixes the generic issue. I guess language sensitive versions of the collation list could be created too.
Lennart Regebro
+1  A: 

Jeff Atwood wrote a good post on Natural Sort Order, in it he linked to a script which does pretty much what you ask.

It's not a trivial script, by any means, but it does the trick.

Simon Scarfe
A: 

To implement it you will need to read about "Unicode collation algorithm" see http://en.wikipedia.org/wiki/Unicode_collation_algorithm

http://www.unicode.org/unicode/reports/tr10/

a sample implementation is here

http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm/

Anurag Uniyal
+8  A: 

IBM's ICU library does that (and a lot more). It has Python bindings: PyICU.

Rafał Dowgird
+3  A: 

I see the answers have already done an excellent job, just wanted to point out one coding inefficiency in Human Sort. To apply a selective char-by-char translation to a unicode string s, it uses the code:

spec_dict = {'Å':'A', 'Ä':'A'}

def spec_order(s):
    return ''.join([spec_dict.get(ch, ch) for ch in s])

Python has a much better, faster and more concise way to perform this auxiliary task (on Unicode strings -- the analogous method for byte strings has a different and somewhat less helpful specification!-):

spec_dict = dict((ord(k), spec_dict[k]) for k in spec_dict)

def spec_order(s):
    return s.translate(spec_dict)

The dict you pass to the translate method has Unicode ordinals (not strings) as keys, which is why we need that rebuilding step from the original char-to-char spec_dict. (Values in the dict you pass to translate [as opposed to keys, which must be ordinals] can be Unicode ordinals, arbitrary Unicode strings, or None to remove the corresponding character as part of the translation, so it's easy to specify "ignore a certain character for sorting purposes", "map ä to ae for sorting purposes", and the like).

In Python 3, you can get the "rebuilding" step more simply, e.g.:

spec_dict = ''.maketrans(spec_dict)

See the docs for other ways you can use this maketrans static method in Python 3.

Alex Martelli
+1  A: 

It is far from a complete solution for your use case, but you could take a look at the unaccent.py script from effbot.org. What it basically does is remove all accents from a text. You can use that 'sanitized' text to sort alphabetically. (For a better description see this page.)

Mark van Lent
+3  A: 

I don't see this in the answers. My Application sorts according to the locale using python's standard library. It is pretty easy.

# python2.5 code below
# corpus is our unicode() strings collection as a list
corpus = [u"Art", u"Älg", u"Ved", u"Wasa"]

import locale
# this reads the environment and inits the right locale
locale.setlocale(locale.LC_ALL, "")
# alternatively, (but it's bad to hardcode)
# locale.setlocale(locale.LC_ALL, "sv_SE.UTF-8")

corpus.sort(cmp=locale.strcoll)

# in python2.x, locale.strxfrm is broken and does not work for unicode strings
# in python3.x however:
# corpus.sort(key=locale.strxfrm)


Question to Lennart and other answerers: Doesn't anyone know 'locale' or is it not up to this task?

kaizer.se
I don't know, but it seems worth a try.
Lennart Regebro
By the way 1) I don't thinkn locale.strxfrm is broken for UTF-8 encoded `str'; I benchmarked by application and concluded that using cmp=strcoll on unicode objects is cheaper than decoding all to UTF-8 and using key=strxfrm
kaizer.se
By the way 2) The locale module will only work with your generated locales (for a Linux box), not any arbitrary locale. "locale -a" will tell you which
kaizer.se