views:

27

answers:

1

I have a sample file containg "aA0_- " characters on each one on a single. Sorting it using GNU sort gives the following sort order:

$ cat /tmp/sample | sort

_
-
0
a
A

after appending some other character, we obtain a different order (non-alphanumeric characters seems to have lower priority):

$ cat /tmp/sample | sed 's/$/x/' | sort
0x
ax
Ax
 x
_x
-x

while when we insert this character to the beginning, we obtain the original sort order:

$ cat /tmp/sample | sed 's/^/x/' | sort
x 
x_
x-
x0
xa
xA

.. what is the explanation of such behavior?

UPDATE

when 'z' and 'Z' characters are included in the sample, the result seems yet sranger:

$ cat /tmp/sample | sed 's/$/x/' | sort
0x
ax
Ax
 x
_x
-x
zx
Zx

.. but in the light of the correct answer, it is so because all '', '_' and '-' are whitespace in the current locale (en_US.UTF-8) and are not ignored in sorting.

+3  A: 

Your locale file should contain a definition of LC_COLLATE. This determines the sort order of characters. Also check the definition of LC_CTYPE, and which characters are classified as 'space'.

if '-' and '_' are classified as space, you might find the results you have shown.

Sanjay Manohar