views:

1424

answers:

13

Sorting a list of strings in a way that makes sense to a human is a very complex task. It's not just about comparing ASCII values. Usually, the case doesn't matter. You probably want "File 2" to be sorted before "File 11". In German, 'Ä' often comes at the beginning of the alphabet, whereas in Swedish it's towards the end. And what about encoding? The list of pitfalls is long.

So, fellow developer: what do you have to say about string sorting for human consumption? What are good techniques? What should be avoided? What are the correct APIs for your favorite platform?

Post your answers, and make this page the definitive reference about sorting strings for humans.

+1  A: 

Use StrCmpLogicalW on Win32.

Discussion here: http://blogs.msdn.com/michkap/archive/2005/01/05/346933.aspx

Nick
+2  A: 

Get a start by reading this article from Jeff Atwood:
http://www.codinghorror.com/blog/archives/001018.html

Mark Ransom
And read the comments. They bring up a lot of good points to think about when deciding what's best for your application. Clearly, many of Jeff's readers are smarter than he.
JasonFruit
+1  A: 

As you yourself have pointed out there's no magic bullet here - you need to be context-appropriate. However some heuristics could be implemented - for the example, you give with file numbers, you could strip off the number from the end and compare them numerically if the non-numeric parts are equal.

And of course all string should be lowercased for the purposes of comparison.

levik
A: 

Just a MySQL sorting trick:

To sort strings: 1, 10, 7a, 80, 8b in a more human-friendly manner:

ORDER BY CAST(room_number AS UNSIGNED), room_number;

Which will sort them: 1, 7a, 8b, 10, 80

micahwittman
+6  A: 

See NatSort. Martin Pool Lists a number of other ports.

POSIX sort(1) has the -n option to sort numbers, but this doesn't work if there is a non-numeric prefix.

GNU ls(1) has the --sort=version option, which works the same way.

The PHP scripting language now has a strnatcmp function based on this code. The PHP wrapper was done by Andrei Zimievsky.

Stuart Cheshire has a Macintosh "system extension" to do natural ordering. I indepdendently reinvented the algorithm, but Stuart had it first. I borrowed the term "natural sort" from him.

Sort::Versions in Perl. "The code has some special magic to deal with common conventions in program version numbers, like the difference between 'decimal' versions (eg perl 5.005) and the Unix kind (eg perl 5.6.1)."

Sort::Naturally is also in Perl, by Sean M. Burke. It uses locale-sensitive character classes to sort words and numeric substrings in a way similar to natsort.

Ed Avis wrote something similar in Haskell.

Pierre-Luc Paour wrote a NaturalOrderComparator in Java

Kristof Coomans wrote a natural sort comparison in Javascript

Alan Davies wrote natcmp.rb, an implementation in Ruby.

Numacomp - similar thing in Python.

sethbc
Numacomp is a C extension module. Python implementation of natural sort http://stackoverflow.com/questions/34518/natural-sorting-algorithm/341745#341745
J.F. Sebastian
+1  A: 

You should also read Joel on Software's post: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Adam Tegen
A lot of people think that a string is just a string, regardless of language. Character encoding is a topic usually overlooked until a problem is encountered, when in fact it is one of the basics of computer science!
Liam
A: 

For what it's worth, this is sometimes referred to as string collation.

Adam Tegen
+3  A: 

This is OT, but just in case someone is interested - in Czech, the pair of letters ch is between letters h and i in the alphabet.

  • aaaa
  • bbbb
  • cccc
  • dddd
  • ...
  • chachacha -
  • iiii
  • jjj

Thus, in some languages, not only letters alone need to be examined for sorting purposes, but also their position relative to other letters.

petr k.
Hungarian has a number of letters that are multi-letter combinations as well - so you're right, single characters aren't the only way to sort.
sethbc
+2  A: 

Some others related questions from StackOverflow:

  • Marius: Natural Sorting Algorithm

    How do you sort an array of strings naturally in different programming languages? Post your implementation and what language it is in in the answer.

  • Keith: Natural (human alpha-numeric) sort in Microsoft SQL 2005

    We have a large database on which we have DB side pagination. This is quick, returning a page of 50 rows from millions of records in a small fraction of a second. Users can define their own sort, basically choosing what column to sort by. Columns are dynamic - some have numeric values, some dates and some text.[...] Does anyone know a way to quickly apply a natural sort in Sql server?

  • sgehrig: How to sort an array of UTF-8 strings?

    I currentyl have no clue on how to sort an array which contains UTF-8 encoded strings in PHP. [...] Is there a way to sort an array with UTF-8 strings locale aware?

Carl Seleborg
+1  A: 

To further complicate matters have a look at String based dates. For example, you and I both know that "30-11-2008.jpg" comes before "01-12-2008.jpg", but your natural sorting algorithm of choice probably doesn't. This is an excellent example of a context where special sauce is required. Think of the myriad of date formats and languages that you'd need to support to get this working properly!

ninesided
If you're not from the us that is. 30-11-2008 is not a valid date, 11-30-2008 is (when really, you would need to put the date as 2008-11-30 for it to have a chance at sorting it correctly)
FryGuy
haha let's not get into the *rediculous* date format the US uses, this just further shows how problematic sorting date strings can be!
ninesided
+3  A: 

The Unicode Collation Algorithm covers all issues related to localized sorting. Unicode also provides the rules needed to implement sorting for widely used locales. Of course, support for localized sorting is provided as part of Unicode support in some platforms.

erickson
+1  A: 

Others have already linked to APIs that can be used for natural string sorting, but no-one apparently mentioned the possibility of computing a sort key from the string first.

A sort key is a representation of the original string that can be sorted lexicographically. The documentation for the LCMapStringEx function should provide more information.

avakar