views:

255

answers:

3

I have encountered some strange Perl behavior: using a Posix character class in a regexp completely alters the sort order for the resulting strings.

Here is my test program:

sub namecmp($a,$b) {
  $a=~/([:alpha:]*)/;
  # $a=~/([a-z]*)/;
  $aword= $1;

  $b=~/([:alpha:]*)/;
  # $b=~/([a-z]*)/;
  $bword= $1;
  return $aword cmp $bword;
};

$_= <>;
@names= sort namecmp split;
print join(" ", @names), "\n";

If you change to the commented-out regexp's using [a-z], you get the normal, lexicographic sort order. However, the Posix [:alpha:] character class yields some weird-ass sort order, as follows:

$test_normal
aaa aab aac aba abb abc aca acb acc baa bab bac bba bbb bbc bca bcb bcc caa cbb
aaa aab aac aba abb abc aca acb acc baa bab bac bba bbb bbc bca bcb bcc caa cbb

$test_posix
aaa aab aac aba abb abc aca acb acc baa bab bac bba bbb bbc bca bcb bcc caa cbb
baa bab bac bba bbb bbc bca bcb bcc caa cbb aba abb abc aca acb acc aab aac aaa

My best guess is that the Posix character class is activating some kind of locale stuff I've never heard of and didn't ask for. I suppose the logical reaction to "doctor, doctor, it hurts when I do this!" is, "well, don't do that, then!".

But, can anyone tell me what's happening here, and why? I'm using perl 5.10, but I believe it also works under perl 5.8.

+9  A: 

The character class [:alpha:] represents alpha characters in Perl regular expressions, but the square brackets do not mean what they normally do in regular expressions. So you need:

$a=~/([[:alpha:]]*)/;

This is mentioned in perlre:

The POSIX character class syntax

[:class:]

is also available. Note that the [ and ] brackets are literal; they must always be used within a character class expression.

# this is correct:
$string =~ /[[:alpha:]]/;

# this is not, and will generate a warning:
$string =~ /[:alpha:]/;
Greg Hewgill
+6  A: 

Because Perl doesn't support POSIX character classes in this form. (Use [[:alpha:]]. See @Greg's answer)

So

[:alpha:]

is interpreted as a character class consisting of the characters "a", "h", "l", "p" and ":".

Now, for strings that do nothing contain [ahlp:] at the beginning (because of the *), e.g. "baa" the match will return an empty string. An empty string of course is of course smaller than any other strings, so they will be arranged at the beginning.

KennyTM
+6  A: 

What you are writing is not Perl by any stretch of the imagination. You are able to get away with it because you have turned off warnings. If you had used warnings, perl would have told you

POSIX syntax [: :] belongs inside character classes in regex; marked by <-- HERE in m/([:alpha:] <-- HERE *)/ at j.pl line 4.

POSIX syntax [: :] belongs inside character classes in regex; marked by <-- HERE in m/([:alpha:] <-- HERE *)/ at j.pl line 8.

Imagine that!

Now, perl would have also told you:

Illegal character in prototype for main::namecmp : $a,$b at j.pl line 3.

because, Perl is not C. Perl does not have function prototypes of the sort you seem to be trying to use.

A better way of writing the exact same functionality, in Perl this time, is:

use warnings; use strict;

sub namecmp {
    my ($aword) = $a =~ /([[:alpha:]]*)/;
    my ($bword) = $b =~ /([[:alpha:]]*)/;
    return $aword cmp $bword;
}

print join(' ', sort namecmp split ' ', scalar <>), "\n";
Sinan Ünür
Meh, it mostly worked. On actually reading the docs, it's surprising my "prototype" worked at all. Although, I have to take issue with your initial assertion: what I wrote *was* perl, by definition, because it was accepted and run without complaint.
comingstorm