tags:

views:

1372

answers:

7

When I make a call to the C string compare function like this:

strcmp("time","time")

It returns 0, which implies that the strings are not equal.

Can anyone tell me why C implementations seem to do this? I would think it would return a non-zero value if equal. I am curious to the reasons I am seeing this behavior.

+10  A: 

strcmp returns a lexical difference (or should i call it "short-circuit serial byte comparator" ? :-) ) of the two strings you have given as parameters. 0 means that both strings are equal

A positive value means that s1 would be after s2 in a dictionary.

A negative value means that s1 would be before s2 in a dictionary.

Hence your non-zero value when comparing "time" and "money" which are obviously different, even though one would say that time is money ! :-)

Benoît
strcmp() does not do a lexical comparison, it merely compares the value of each character until there is a difference or both strings terminate.
Ferruccio
I could swear I've read docs that said strcmp() does a lexical comparison, but some quick googling suggests Ferruccio is correct... could it have changed at some point?
rmeador
Nope. Always that way :-)
Ferruccio
Hmmm I'm pretty sure it's a lexical comparison !http://www.cplusplus.com/reference/clibrary/cstring/strcmp.html
Benoît
I think it changes with the given definition of "lexical". I too have seen strcmp() referred to as being a 'lexical' comparator, though obviously it does not take different localizations, collations, etc. into account. Maybe it is better to call it a "short-circuit serial byte comparator."
aib
It simply uses each character's ASCII value for comparison which are ordered like: ABCDE...abcde. A lexical comparison would order them like: AaBbCcDdEe.
Ferruccio
@Ferruccio I updated my answer. But you'll admit that the idea behind the comparison is lexical ! :-)
Benoît
there is a strcoll. i think that one takes the current locale into account
Johannes Schaub - litb
@litb, right it does, but for some strange reasons it falls back to strcmp if the locale is set to 'C' or 'POSIX' and not to strcasecmp, which would be more appropriate.
quinmars
+2  A: 

You seem to want strcmp to work like a (hypothetical)

int isEqual(const char *, const char *)

To be sure that would be true to the "zero is false" interpretation of integer results, but it would complicate the logic of sorting because, having established that the two strings were not the same, you would still need to learn which came "earlier".

Moreover, I suspect that a common implementation looks like

int strcmp(const char *s1, const char *s2){
   const unsigned char *q1=s1, *q2=s2;
   while ((*q1 == *q2) && *q1){ 
      ++q1; ++q2; 
   };
   return (*q1 - *q2);
}

which is [edit: kinda] elegant in a K&R kind of way. The important point here (which is increasingly obscured by getting the code right (evidently I should have left well enough alone)) is the way the return statement:

   return (*q1 - *q2);

which gives the results of the comparison naturally in terms of the character values.

dmckee
Wouldn't that "common implementation" walk off the buffer (past the null termination?) I don't see how it would ever return 0...
Daniel LeCheminant
i think u gotta have to terminate the loop if you come to \0 :) u also have to watch about underflows in *s1 - *s2 so s1=-127, s2=2 => oops :)
Johannes Schaub - litb
@Daniel: Uh. Yes. Moment while I fix this...
dmckee
oh wait. no won't cause an underflow because both are first converted to an int.
Johannes Schaub - litb
Daniel LeCheminant
@Daniel: 'cause I was trying to be quick...
dmckee
i've looked up in c99 it says both are first converted to unsigned char before being subtracted.so there can't even be a case where s1=-127 :) but it changes some: *a - *b could become -2 with char (*a=-1, *b=1) and +254 with unsigned char.so i believe a cast should be added there. kind of tricky :)
Johannes Schaub - litb
@libt: The Mac OS X manpage agrees, but putting in the casts is going to play merry havoc with my allegidly "elegant" implementation. ::sigh::
dmckee
indeed :) i like how you added "kinda". so one still can see the a - b pattern in it without scary casts in between :) all fine i would say
Johannes Schaub - litb
I would think return (int)*q1 - (int)*q2; would solve the sign issue. What makes the casts scary?
jmucchiello
@joe_mucchiello: They would clutter up the one line I really wanted to emphasize in this instance...
dmckee
joe_micchiello, they would not solve it. they're cast (promoted) to int anyway automatically. you will have to do (unsigned char)*q1 - (unsigned char)*q2; i think that really clutters the line. now anyway that they could just look into the comments and see how they could write it in non-teaching way
Johannes Schaub - litb
+2  A: 

It's common to functions to return zero for the common - or one-of-a-kind - case and non-zero for special cases. Take the main function, which conventionally returns zero on success and some nonzero value for failure. The precise non-zero value indicates what went wrong. For example: out of memory, no access rights or something else.

In your case, if the string is equal, then there is no reason why it is equal other than that the strings contain the same characters. But if they are non-equal then either the first can be smaller, or the second can be smaller. Having it return 1 for equal, 0 for smaller and 2 for greater would be somehow strange i think.

You can also think about it in terms of subtraction:

return = s1 - s2

If s1 is "lexicographically" less, then it will give is a negative value.

Johannes Schaub - litb
I'm pretty sure you need to dereference your 's's...
dmckee
i meant it symbolical was pseudo code.
Johannes Schaub - litb
Ah. I withdraw the suggestion.
dmckee
A: 

I guess it is simply for symmetry: -1 if less, 0 if equal, 1 if more.

PhiLho
+3  A: 

The nice thing about an implementation like this is you can say

if(strcmp(<stringA>, <stringB>) > 0)   // Implies stringA > stringB
if(strcmp(<stringA>, <stringB>) == 0)  // Implies stringA == stringB
if(strcmp(<stringA>, <stringB>) < 0)   // Implies stringA < stringB
if(strcmp(<stringA>, <stringB>) >= 0)  // Implies stringA >= stringB
if(strcmp(<stringA>, <stringB>) <= 0)  // Implies stringA <= stringB
if(strcmp(<stringA>, <stringB>) != 0)  // Implies stringA != stringB

Note how the comparison with 0 exactly matches the comparison in the implication.

Daniel LeCheminant
There's a wonderful/horrible macro from the comp.lang.c FAQ that implements almost exactly this string behavior. `#define StrTest(str1, op, str2) (strcmp(str1, str2) op 0)` With it, you would write `if(StrTest(stringA, ==, stringB))` and friends. I'm on the fence as to whether it's a horrible idea or a wonderful idea.
Chris Lutz
+2  A: 

There's three possible results: string 1 comes before string 2, string 1 comes after string 2, string 1 is the same as string 2. It is important to keep these three results separate; one use of strcmp() is to sort strings. The question is how you want to assign values to these three outcomes, and how to keep things more or less consistent. You might also look at the parameters for qsort() and bsearch(), which require compare functions much like strcmp().

If you wanted a string equality function, it would return nonzero for equal strings and zero for non-equal strings, to go along with C's rules on true and false. This means that there would be no way of distinguishing whether string 1 came before or after string 2. There are multiple true values for an int, or any other C data type you care to name, but only one false.

Therefore, having a useful strcmp() that returned true for string equality would require a lot of changes to the rest of the language, which simply aren't going to happen.

David Thornley
+1  A: 

Another reason strcmp() returns the codes it does is so that it can be used directly in the standard library function qsort(), allowing you to sort an array of strings:

#include <string.h> // for strcmp()
#include <stdlib.h> // for qsort()
#include <stdio.h>

int sort_func(const void *a, const void *b)
{
    const char **s1 = (const char **)a;
    const char **s2 = (const char **)b;
    return strcmp(*s1, *s2);
}

int main(int argc, char **argv)
{
    int i;
    printf("Pre-sort:\n");
    for(i = 1; i < argc; i++)
        printf("Argument %i is %s\n", i, argv[i]);
    qsort((void *)(argv + 1), argc - 1, sizeof(char *), sort_func);
    printf("Post-sort:\n");
    for(i = 1; i < argc; i++)
        printf("Argument %i is %s\n", i, argv[i]);
    return 0;
}

This little sample program sorts its arguments ASCIIbetically (what some would call lexically). Lookie:

$ gcc -o sort sort.c
$ ./sort hi there little fella
Pre-sort:
Argument 1 is hi
Argument 2 is there
Argument 3 is little
Argument 4 is fella
Post-sort:
Argument 1 is fella
Argument 2 is hi
Argument 3 is little
Argument 4 is there

If strcmp() returned 1 (true) for equal strings and 0 (false) for inequal ones, it would be impossible to use it to obtain the degree or direction of inequality (i.e. how different, and which one is bigger) between the two strings, thus making it impossible to use it as a sorting function.

I don't know how familiar you are with C. The above code uses some of C's most confusing concepts - pointer arithmetic, pointer recasting, and function pointers - so if you don't understand some of that code, don't worry, you'll get there in time. Until then, you'll have plenty of fun questions to ask on StackOverflow. ;)

Chris Lutz