tags:

views:

53

answers:

2

Hi,

I'm seeing something strange with 'sort' in RedHat Enterprise Linux 5 x86_64 and in Ubuntu 9.1. I'm using bash.

First here's what I think is right to expect from sort using dictionary order:

[stauffer@unix-m sortTrouble]$ cat st1
1230
123
100
11
10
1
123
1230
100

[stauffer@unix-m sortTrouble]$ sort st1
1
10
100
100
11
123
123
1230
1230

[stauffer@unix-m sortTrouble]$

Now here's what happens when there's a second column (tab-delimited, even though it looks messy here):

[stauffer@unix-m sortTrouble]$ cat st2
1230 1
123 1
100 1
11 1
10 1
1 1
123 1
1230 1
100 1

[stauffer@unix-m sortTrouble]$ sort st2
100 1
100 1
10 1
1 1
11 1
1230 1
1230 1
123 1
123 1

Notice how the sort order for column 1 is different now. '11' gets put correctly after '1', but '10' and '100' do not. Similarly for '1230'. It seems like zero causes trouble.

This behavior is inconsistent, and it causes problems when using 'join' because it expects dictionary sorting.

On Mac OSX 10.5, the st2 file sorts like st1 in the first column.

Am I missing something, or is this a bug?

Thanks, Michael

+4  A: 

from the man page

   -b, --ignore-leading-blanks
          ignore leading blanks

   -g, --general-numeric-sort
          compare according to general numerical value

   -n, --numeric-sort
          compare according to string numerical value

ex:

andrey@localhost:~/gamess$ echo -e "1\n2\n10" | sort
1
10
2
andrey@localhost:~/gamess$ echo -e "1\n2\n10" | sort -g
1
2
10
aaa
True, but how is this relevant? He said at the start that it gives (and he expects) the 1-10-2 order in the one-column case. The difference he's asking about is when there's a second column present. Also he notes that Mac OS X 10.5 (which uses GNU sort) uses the same ordering when two columns are present, but RHEL doesn't.
Ken
@Ken: I think it's actually that RHEL and Ubuntu use GNU sort and OS X uses a BSD version.
Dennis Williamson
Dennis: I'm using Mac OS X 10.5 here right now and `/usr/bin/sort --version` reports `sort (GNU coreutils) 5.93 Copyright (C) 2005 Free Software Foundation, Inc.`.
Ken
@Ken: Precisely. The issue is when two columns are present, the sort order changes. With two columns, the zero in '10' is sorted before the space in '1', wheras the one in '11' is sorted after the space in '1'.
michael
+2  A: 

The sort can be performed the way you want by restricting the key to the column you're interested in:

sort -k1,1 inputfile
Dennis Williamson
Yes! This works, thanks. But I don't understand why it doesn't work without this. W/out keys specified, it defaults to the entire line as a key. So in the case of one "column", e.g.: {100, 10, 1} "100" is sorted after "10", and "10" is after "1", meaning "0" is sorted after the space char. But with two (space- or tab- delimited) "columns", e.g. {100 1, 10 1} "100 1" is sorted before "10 1", meaning "0" is sorted before space when each line is treated as a single key. I'll check locale seetings some more. I tried setting LC_ALL=C like the docs suggest, but that didn't change anything.
michael
`LANG=C` and `LC_ALL=C` both worked for me. `LC_ALL=C sort inputfile` (all on one line). "0" before space is a locale thing.
Dennis Williamson