ansaurus

Question

What is Perl's "standard string comparison order"?

Answer 1

A:

My guess is the perl developers just call strcmp. So the standard string comparison order depends on what libc does on your machine. Or, how gcc compiles:

int strcmp(const char *s1, const char *s2)
{
    while((*s1 && *s2) && (*s1++ == *s2++));
    return *(--s1) - *(--s2);
}

Andomar 2009-11-04 23:14:55

They don't call strcmp. Read the source. I'm not looking for guesses. I can do that on my own. :)

brian d foy 2009-11-04 23:16:34

So you think anyone can read the perl source, and give a non-guess answer?

Andomar 2009-11-04 23:47:44

Yes, there are people on Stackoverflow who not only read the perl source, they even wrote some of it. Anyone can read the source, which is why I pointed to the file and line number.

brian d foy 2009-11-04 23:54:32

@Andomar: anyone can, but not everyone will or has the knowledge to do so accurately. Which is why not everyone should be trying to answer this question. :)

Ether 2009-11-05 00:12:01

@Ester: If you can, why don't you download http://www.perl.com/CPAN/src/perl-5.10.0.tar.gz and answer the question? :)

Andomar 2009-11-05 00:14:45

@Andomar: why download old source when you can get the latest? Really, when you're in a hole, stop digging.

brian d foy 2009-11-05 00:25:20

@Andomar: perl 5.10.1 was released this summer and 5.11.1 in October, but most uptodate source is in repository

Alexandr Ciornii 2009-11-05 02:39:33

Thanks for the link! It's kind of amusing that my answer is substantially the same as the accepted answer; perl does the equivalent of a strmcp compare :)

Andomar 2009-11-05 09:41:39

What accepted answer?

innaM 2009-11-05 10:45:07

@Manni: meant Hobb's answer, looks like it hasn't been accepted yet

Andomar 2009-11-05 11:07:10

@Andomar: your answer is *not* substantially the same; the fact that you would say this means you don't understand low-level language internals.

Ether 2009-11-05 19:31:51

@Ether: Based on the other answers Perl does a bitstream sort, much like strcmp. I agree that perl doesn't call the C library's strcmp function like I suggested.

Andomar 2009-11-05 19:44:55

Why don't you just delete this **very** incorrect answer?

Brad Gilbert 2009-11-17 14:40:30

@Brad Gilbert: It might be very incorrect but I don't understand why. Can you comment one example of a string that `strcmp` orders differently than `gt` ? (no `'\0'` allowed)

Andomar 2009-11-17 15:55:42

It's wrong because it misses the point. It's not the function that you're calling but the data that it compares that I'm asking about.

brian d foy 2009-11-24 01:21:49

Answer 2

+13 A:

UTF-8 has the property that sorting a UTF-8 string byte-by-byte according to the byte value gives the same ordering as sorting it codepoint-by-codepoint according to the codepoint number. That is, I know without looking that the UTF-8 representation of U+2345 is lexicographically after the UTF-8 representation of U+1234.

As for normalization, the Perl core doesn't know anything about it; to get accurate sorting and comparison among the different forms you would want to run all of your strings through Unicode::Normalize and convert them all to the same normalization form. I can't comment on which is best for any given purpose, mostly because I have no clue.

Also, sorting and cmp are affected by the locale pragma if it's in use; it uses the POSIX collation order. Using use locale, an 8-bit locale, and unicode all together is a recipe for disaster, but using use locale, a UTF-8 locale, and unicode should work usefully. I can't say I've tried it. There's an awful lot of info in perllocale and perlunicode anyway.

hobbs 2009-11-05 01:06:06

Okay, I think that's the sort of confirmation I needed. I thought that was how it worked but I wasn't sure. Come to a meeting sometime so I can buy you a beer. :)

brian d foy 2009-11-05 01:12:14

I've been meaning to get to some meetings, but my schedule usually has me working until 7PM or later, so I usually have to miss them. I'll try to work something out.

hobbs 2009-11-05 01:16:55

That's a really interesting fact I didn't know. Seems like a smart design decision to me! (Obvious in hindsight, but hey, most smart decisions are.)

Leonardo Herrera 2009-11-05 13:02:19

Answer 3

+5 A:

I can't answer the whole question, so let me hone in on one part:

    const I32 retval = memcmp((const void*)pv1, (const void*)pv2, cur1 < cur2 ? cur1 : cur2);

... looks like once it has pv1 and pv2, which were coerced to char *, now are just compared byte-by-byte because they are coerced to void *. Is that what happens with memcmp

Pretty much. The main differences differences between memcmp and strcmp are:

strcmp will stop once it sees a NULL (i.e., '\0'), and Perl allows scalars to have embedded NULLs
memcmp often runs just a little bit faster than strcmp

But aside from that you're going to get the same results.

Max Lybbert 2009-11-05 01:22:51

ansaurus

tags:

views:

answers:

What is Perl's "standard string comparison order"?

related questions