tags:

views:

85

answers:

2

The following code

#!/usr/bin/perl

use strict;
use warnings;

my $s1 = '[email protected]';
my $s2 = '[email protected]';
my $s3 = 'aaa2000';
my $s4 = 'aaa_2000';

no locale;

print "\nNO Locale:\n\n";

if ($s1 gt $s2) {print "$s1 is > $s2\n";}
if ($s1 lt $s2) {print "$s1 is < $s2\n";}
if ($s1 eq $s2) {print "$s1 is = $s2\n";}

if ($s3 gt $s4) {print "$s3 is > $s4\n";}
if ($s3 lt $s4) {print "$s3 is < $s4\n";}
if ($s3 eq $s4) {print "$s3 is = $s4\n";}

use locale;

print "\nWith 'use locale;':\n\n";

if ($s1 gt $s2) {print "$s1 is > $s2\n";}
if ($s1 lt $s2) {print "$s1 is < $s2\n";}
if ($s1 eq $s2) {print "$s1 is = $s2\n";}

if ($s3 gt $s4) {print "$s3 is > $s4\n";}
if ($s3 lt $s4) {print "$s3 is < $s4\n";}
if ($s3 eq $s4) {print "$s3 is = $s4\n";}

prints out

NO Locale:

[email protected] is < [email protected]
aaa2000 is < aaa_2000

With 'use locale;':

[email protected] is > [email protected]
aaa2000 is < aaa_2000

which I cannot really follow: in the same time, under use locale, there is a < b AND [email protected] > [email protected] ?!!

Am I missing something more or less obvious, or is this a bug? Can others confirm to see the same behavior ?

Locale is $ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Thanks in advance.

A: 

My results:

$ locale
LANG=C.UTF-8
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_ALL=

$ perl -v

This is perl, v5.10.1 (*) built for i686-cygwin-thread-multi-64int
(with 12 registered patches, see perl -V for more detail)

Copyright 1987-2009, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl".  If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.

$ perl locale.pl

NO Locale:

[email protected] is < [email protected]
aaa2000 is < aaa_2000

With 'use locale;':

[email protected] is < [email protected]
aaa2000 is < aaa_2000

So with my setup it's being consistent.

CanSpice
Interesting, thanks. My perl -v: This is perl, v5.10.1 (*) built for x86_64-linux-thread-multi
Krambambuli
That's because you used the `C` locale instead of `en_US`.
cjm
+2  A: 

I get the same results on my 32-bit Linux system with the en_US.utf8 locale. It's not a Perl bug, as illustrated by this C program:

#include <locale.h>
#include <string.h>
#include <stdio.h>

void transformed(const char* str)
{
  char dest[256];
  const char* c;

  strxfrm(dest, str, sizeof(dest));
  printf("%18s =", str);
  for (c = dest; *c; ++c) printf(" %02x", *c);
  puts("");
} /* end transformed */

void test_strings(const char* s1, const char* s2)
{
  int c = strcoll(s1, s2);

  printf("%s is %s %s\n", s1, ((c < 0) ? "<" : ((c == 0) ? "=" : ">")), s2);
} /* end test_strings */

int main(int argc, char* argv[])
{
  puts("with C locale:");

  test_strings("[email protected]", "[email protected]");
  test_strings("aaa2000", "aaa_2000");

  setlocale(LC_ALL, "");
  puts("\nwith your locale:");

  test_strings("[email protected]", "[email protected]");
  test_strings("aaa2000", "aaa_2000");
  puts("");
  transformed("[email protected]");
  transformed("[email protected]");
  transformed("aaa2000");
  transformed("aaa_2000");
  return 0;
} /* end main */

With LANG=en_US.utf8, it generates:

with C locale:
[email protected] is < [email protected]
aaa2000 is < aaa_2000

with your locale:
[email protected] is > [email protected]
aaa2000 is < aaa_2000

 [email protected] = 0c 0c 0c 04 02 02 02 24 0c 13 1a 1a 0e 1a 18 01 08 08 08 08 08 08 08 08 08 08 08 08 08 08 08 01 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 01 08 5d 06 44
[email protected] = 0c 0c 0c 04 02 02 02 24 0c 13 1a 1a 0e 1a 18 01 08 08 08 08 08 08 08 08 08 08 08 08 08 08 08 01 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 01 04 36 05 5d 06 44
           aaa2000 = 0c 0c 0c 04 02 02 02 01 08 08 08 08 08 08 08 01 02 02 02 02 02 02 02
          aaa_2000 = 0c 0c 0c 04 02 02 02 01 08 08 08 08 08 08 08 01 02 02 02 02 02 02 02 01 04 36

The strxfrm function (which you can access in Perl through the POSIX module) returns a string which indicates the collation order. When you compare two such transformed strings byte-for-byte, the first one to have a smaller byte comes first in the collation order.

I'm not sure if this is a bug or not. I can't seem to find any documentation on how the en_US collation order is supposed to work. If it is a bug, it's in your C library or locale database.

cjm
Sounds like a bug, possibly an intentional one knowing the glibc developers...
R..
I'm suspecting that the above mentioned issue is related to the following one: in a simple file containing 2 records with 2 TAB separated fields, like 'a_2 2/a2 1' a command like 'sort -k 1 file | cut -f 1' will display an inverse order than the one shown with the same sort but on a file not having the second field.
Krambambuli