tags:

views:

203

answers:

2

Hello,

I'm running Python 2.5 (r25:51908, Sep 19 2006, 09:52:17) [MSC v.1310 32 bit (Intel)] on win 32

When I'm asking Python

>>> "u11-Phrase 099.wav" <  "u11-Phrase 1000.wav"
True

That's fine. When I ask

>>> "u11-Phrase 100.wav" <  "u11-Phrase 1000.wav"
True

That's fine, too. But when I ask

>>> "u11-Phrase 101.wav" <  "u11-Phrase 1000.wav"
False

So according Python "u11-Phrase 100.wav" comes before "u11-Phrase 1000.wav" but "u11-Phrase 101.wav" comes after "u11-Phrase 1000.wav"! And this is problematic for me because I'm trying to write a file renaming program and this kind of sorting breaks the functionality.

What can I do to overcome this? Should I write my own cmp function and test for edge cases or is there a much simpler shortcut to give me the ordering I want?

On the other hand if I modify the strings such as

>>> "u11-Phrase 0101.wav" <  "u11-Phrase 1000.wav"
True

However those strings come from the file listing of directory such as:

files = glob.glob('*.wav')
files.sort()
for file in files:
    ...

So I'd rather not do surgical operations on the strings after they have been created by glob. And no, I don't want to change the original filenames in that folder, too.

Any hints?

+15  A: 

You are looking for human sorting.

The reason 101.wav is not less than 1000.wav is that computers (not just Python) sort strings character by character, and the first difference between these two strings is where the first string has a '1' and the second string has a '0'. '1' is not less than '0', so the strings compare as you have seen.

People naturally parse those strings into their components, and interpret the numbers numerically, not lexically. The code I linked to above will do that same sort of parsing.

Ned Batchelder
Also called natural sorting: http://www.codinghorror.com/blog/archives/001018.html
John Kugelman
+1. Cool link, I was just about to answer the question. No need for this now.
Boldewyn
+8  A: 

You need to construct a proper sort key for each filename. Something like this should do what you want:

import re

def k(s):
    return [w.isdigit() and int(w) or w for w in re.split(r'(\d+)', s)]

files = ["u11-Phrase 099.wav", "u11-Phrase 1000.wav", "u11-Phrase 100.wav"]

print files
print sorted(files, key=k)

It gives this output:

['u11-Phrase 099.wav', 'u11-Phrase 1000.wav', 'u11-Phrase 100.wav']
['u11-Phrase 099.wav', 'u11-Phrase 100.wav', 'u11-Phrase 1000.wav']

The k function will split apart the filenames on sequences of digits and (more importantly) turn those sequences into integers:

>>> k('u11-Phrase 099.wav')
['u', 11, '-Phrase ', 99, '.wav']

We then use the fact that Python knows how to sort lists --- it sorts the lists by comparing each element one by one. The end result is that

>>> k('u11-Phrase 99.wav') < k('u11-Phrase 100.wav')
True

whereas

>>> 'u11-Phrase 99.wav' < 'u11-Phrase 100.wav'
False

as you've already found out.

Martin Geisler
Thank you very much. This does what I expect and I can use it like for file in sorted(files, key=k): print file ...
Emre Sevinç
I wonder whether this Natural / Human sort ordering will ever be part of any standard lib in any programming language...
Emre Sevinç
@Emre, probably not, because it would be very hard to find something that applies generally. For example, it's just as likely the "u11" part of the input should be treated as a "u" and integer 11 as it is that it should be treated as a simple string of 3 chars. Usually this is entirely context-sensitive. And in Python, at least, writing a custom sort key function is too easy to bother making just the one case a standard library routine.
Peter Hansen
@Emre: I'm glad you like it! The `sorted` built-in is very handy for these things. Btw, you should probably give the function a better name than `k` in your script :-) Perhaps `natural_sort_key` or something like that.
Martin Geisler