views:

613

answers:

3

This is really a double question, my two end goals having answers to:

  • What is the standard string comparison order, in terms of the mechanics?
  • What's a better name for that so I can update the docs?

Perl's documentation for sort says that without a block, sort uses "standard string comparison order". But what is that order? There should be a better name for it. For this question, I specifically mean the situation where locale is not in effect, since that defines its own order.

In years past, we normally called the standard sort order "ASCIIbetically". It's in Learning Perl and many other books. However, that term is dated. Perl has been Unicode-aware since 5.6. Talking about ASCII is old school. Since Perl is also Unicode-aware, it knows about character strings. In sv.c, Perl_sv_cmp knows about locale, bytes, and UTF-8. The first two are easy. But I'm not confident about the third.

/*
=for apidoc sv_cmp

Compares the strings in two SVs.  Returns -1, 0, or 1 indicating whether the
string in C<sv1> is less than, equal to, or greater than the string in
C<sv2>. Is UTF-8 and 'use bytes' aware, handles get magic, and will
coerce its args to strings if necessary.  See also C<sv_cmp_locale>.

=cut
*/

When Perl sorts using UTF-8, what is it really sorting? The bytes the string encodes to, the characters it represents (including marks maybe?), or something else? I think this is the relevant line in sv.c (line 6698 for commit 7844ec1):

 pv1 = tpv = (char*)bytes_to_utf8((const U8*)pv1, &cur1);

If I'm reading that right (using my rusty C), pv1 is coerced to octets, turned into UTF-8, then coerced to characters (in the C sense). I think that means it's sorting by the UTF-8 encoding (i.e. the actual bytes that UTF-8 uses to represent a code point). Another way to say that is that it doesn't sort on graphemes. I think I've almost convinced myself I'm reading this right, but some of you know way more about this than I do.

From that, the next interesting line is 6708:

 const I32 retval = memcmp((const void*)pv1, (const void*)pv2, cur1 < cur2 ? cur1 : cur2);

To me that looks like once it has pv1 and pv2, which were coerced to char *, now are just compared byte-by-byte because they are coerced to void *. Is that what happens with memcmp, which looks like it's just comparing bits based on the various docs I've read so far? Again, I'm wondering what I'm missing in the trip from bytes->utf8->char->bytes, like maybe a Unicode normalization step. Checking out Perl_bytes_to_utf8 in utf8.c didn't help me answer that question.

As a side note, I'm wondering if this is the same thing as the Unicode Collation Algorithm? If it is, why does Unicode::Collate exist? From the looks of it, I don't think Perl's sort handles canonical equivalence.

A: 

My guess is the perl developers just call strcmp. So the standard string comparison order depends on what libc does on your machine. Or, how gcc compiles:

int strcmp(const char *s1, const char *s2)
{
    while((*s1 && *s2) && (*s1++ == *s2++));
    return *(--s1) - *(--s2);
}
Andomar
They don't call strcmp. Read the source. I'm not looking for guesses. I can do that on my own. :)
brian d foy
So you think anyone can read the perl source, and give a non-guess answer?
Andomar
Yes, there are people on Stackoverflow who not only read the perl source, they even wrote some of it. Anyone can read the source, which is why I pointed to the file and line number.
brian d foy
@Andomar: anyone can, but not everyone will or has the knowledge to do so accurately. Which is why not everyone should be trying to answer this question. :)
Ether
@Ester: If you can, why don't you download http://www.perl.com/CPAN/src/perl-5.10.0.tar.gz and answer the question? :)
Andomar
@Andomar: why download old source when you can get the latest? Really, when you're in a hole, stop digging.
brian d foy
@Andomar: perl 5.10.1 was released this summer and 5.11.1 in October, but most uptodate source is in repository
Alexandr Ciornii
Thanks for the link! It's kind of amusing that my answer is substantially the same as the accepted answer; perl does the equivalent of a strmcp compare :)
Andomar
What accepted answer?
innaM
@Manni: meant Hobb's answer, looks like it hasn't been accepted yet
Andomar
@Andomar: your answer is *not* substantially the same; the fact that you would say this means you don't understand low-level language internals.
Ether
@Ether: Based on the other answers Perl does a bitstream sort, much like strcmp. I agree that perl doesn't call the C library's strcmp function like I suggested.
Andomar
Why don't you just delete this **very** incorrect answer?
Brad Gilbert
@Brad Gilbert: It might be very incorrect but I don't understand why. Can you comment one example of a string that `strcmp` orders differently than `gt` ? (no `'\0'` allowed)
Andomar
It's wrong because it misses the point. It's not the function that you're calling but the data that it compares that I'm asking about.
brian d foy
+13  A: 

UTF-8 has the property that sorting a UTF-8 string byte-by-byte according to the byte value gives the same ordering as sorting it codepoint-by-codepoint according to the codepoint number. That is, I know without looking that the UTF-8 representation of U+2345 is lexicographically after the UTF-8 representation of U+1234.

As for normalization, the Perl core doesn't know anything about it; to get accurate sorting and comparison among the different forms you would want to run all of your strings through Unicode::Normalize and convert them all to the same normalization form. I can't comment on which is best for any given purpose, mostly because I have no clue.

Also, sorting and cmp are affected by the locale pragma if it's in use; it uses the POSIX collation order. Using use locale, an 8-bit locale, and unicode all together is a recipe for disaster, but using use locale, a UTF-8 locale, and unicode should work usefully. I can't say I've tried it. There's an awful lot of info in perllocale and perlunicode anyway.

hobbs
Okay, I think that's the sort of confirmation I needed. I thought that was how it worked but I wasn't sure. Come to a meeting sometime so I can buy you a beer. :)
brian d foy
I've been meaning to get to some meetings, but my schedule usually has me working until 7PM or later, so I usually have to miss them. I'll try to work something out.
hobbs
That's a really interesting fact I didn't know. Seems like a smart design decision to me! (Obvious in hindsight, but hey, most smart decisions are.)
Leonardo Herrera
+5  A: 

I can't answer the whole question, so let me hone in on one part:

    const I32 retval = memcmp((const void*)pv1, (const void*)pv2, cur1 < cur2 ? cur1 : cur2);

... looks like once it has pv1 and pv2, which were coerced to char *, now are just compared byte-by-byte because they are coerced to void *. Is that what happens with memcmp

Pretty much. The main differences differences between memcmp and strcmp are:

  1. strcmp will stop once it sees a NULL (i.e., '\0'), and Perl allows scalars to have embedded NULLs
  2. memcmp often runs just a little bit faster than strcmp

But aside from that you're going to get the same results.

Max Lybbert