This is really a double question, my two end goals having answers to:
- What is the standard string comparison order, in terms of the mechanics?
- What's a better name for that so I can update the docs?
Perl's documentation for sort says that without a block, sort
uses "standard string comparison order". But what is that order? There should be a better name for it. For this question, I specifically mean the situation where locale is not in effect, since that defines its own order.
In years past, we normally called the standard sort order "ASCIIbetically". It's in Learning Perl and many other books. However, that term is dated. Perl has been Unicode-aware since 5.6. Talking about ASCII is old school. Since Perl is also Unicode-aware, it knows about character strings. In sv.c, Perl_sv_cmp
knows about locale
, bytes
, and UTF-8. The first two are easy. But I'm not confident about the third.
/*
=for apidoc sv_cmp
Compares the strings in two SVs. Returns -1, 0, or 1 indicating whether the
string in C<sv1> is less than, equal to, or greater than the string in
C<sv2>. Is UTF-8 and 'use bytes' aware, handles get magic, and will
coerce its args to strings if necessary. See also C<sv_cmp_locale>.
=cut
*/
When Perl sorts using UTF-8, what is it really sorting? The bytes the string encodes to, the characters it represents (including marks maybe?), or something else? I think this is the relevant line in sv.c (line 6698 for commit 7844ec1):
pv1 = tpv = (char*)bytes_to_utf8((const U8*)pv1, &cur1);
If I'm reading that right (using my rusty C), pv1
is coerced to octets, turned into UTF-8, then coerced to characters (in the C sense). I think that means it's sorting by the UTF-8 encoding (i.e. the actual bytes that UTF-8 uses to represent a code point). Another way to say that is that it doesn't sort on graphemes. I think I've almost convinced myself I'm reading this right, but some of you know way more about this than I do.
From that, the next interesting line is 6708:
const I32 retval = memcmp((const void*)pv1, (const void*)pv2, cur1 < cur2 ? cur1 : cur2);
To me that looks like once it has pv1
and pv2
, which were coerced to char *
, now are just compared byte-by-byte because they are coerced to void *
. Is that what happens with memcmp
, which looks like it's just comparing bits based on the various docs I've read so far? Again, I'm wondering what I'm missing in the trip from bytes->utf8->char->bytes, like maybe a Unicode normalization step. Checking out Perl_bytes_to_utf8
in utf8.c didn't help me answer that question.
As a side note, I'm wondering if this is the same thing as the Unicode Collation Algorithm? If it is, why does Unicode::Collate exist? From the looks of it, I don't think Perl's sort
handles canonical equivalence.