views:

332

answers:

2

I am confused by strcmp(), or rather, how it is defined by the standard. Consider comparing two strings where one contains characters outside the ASCII-7 range (0-127).

The C standard defines:

int strcmp(const char *s1, const char *s2);

The strcmp function compares the string pointed to by s1 to the string pointed to by s2.

The strcmp function returns an integer greater than, equal to, or less than zero, accordingly as the string pointed to by s1 is greater than, equal to, or less than the string pointed to by s2.

The parameters are char *. Not unsigned char *. There is no notion that "comparison should be done as unsigned".

But all the standard libraries I checked consider the "high" character to be just that, higher in value than the ASCII-7 characters.

I understand this is useful and the expected behaviour. I don't want to say the existing implementations are wrong or something. I just want to know, which part in the standard specs have I missed?

int strcmp_default( const char * s1, const char * s2 )
{
    while ( ( *s1 ) && ( *s1 == *s2 ) )
    {
        ++s1;
        ++s2;
    }
    return ( *s1 - *s2 );
}

int strcmp_unsigned( const char * s1, const char *s2 )
{
    unsigned char * p1 = (unsigned char *)s1;
    unsigned char * p2 = (unsigned char *)s2;

    while ( ( *p1 ) && ( *p1 == *p2 ) )
    {
        ++p1;
        ++p2;
    }
    return ( *p1 - *p2 );
}

#include <stdio.h>
#include <string.h>

int main()
{
    char x1[] = "abc";
    char x2[] = "abü";
    printf( "%d\n", strcmp_default( x1, x2 ) );
    printf( "%d\n", strcmp_unsigned( x1, x2 ) );
    printf( "%d\n", strcmp( x1, x2 ) );
    return 0;
}

Output is:

103
-153
-153
+4  A: 
  • the C standard leaves it up to the implementation whether "char" without a modifier is treated as signed or unsigned.

  • strcmp() is a function intended for comparing strings of text, not arrays of 8-bit-wide integers, so it is the former domain - the ordering of text characters - that it will be assuming the conventions of and not the latter.

moonshadow
+10  A: 

7.21.4/1 (C99), emphasis is mine:

The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared.

There is something similar in C90.

Note that strcoll() may be more adapted than strcmp() especially if you have character outside the basic character set.

AProgrammer
Excellent. Exactly the kind of answer I've been looking for. Thanks!
DevSolar