views:

310

answers:

3

I'm debugging some opensource code on a 64-bit Solaris system, using GCC, that converts 2byte characters (wchar_t) to 4byte characters (wchar_t). Because Solaris like some other Unixes define wchar_t as 4byte, not 2byte like in Windows.

Now I fixed the problem, through laying it out the pointer arithmetic over two lines, but I'm not sure what was wrong with the original code. Any clues?

Original Code

int StringCopy2to4bytes(const unsigned short* src, int src_size, 
                         unsigned int* dst, int dst_size)
{
 int cp_size = 0;

 const unsigned short *src_end = NULL;
 const unsigned int   *dst_end = NULL;

 unsigned int c1, c2;

 src_end = src + src_size;
 dst_end = dst + dst_size;

 while (src < src_end)
 {
     c1 = *src++;
     if ((c1 >= UNI_SUR_HIGH_START) && (c1 <= UNI_SUR_HIGH_END))
     {
         if (src < src_end)
         {
             c2 = *src;
             if ((c2 >= UNI_SUR_LOW_START) && (c2 <= UNI_SUR_LOW_END))
             {
                c1 = ((c1 - UNI_SUR_HIGH_START) << UNI_SHIFT) + 
                      (c1 - UNI_SUR_LOW_START )  + UNI_BASE;

                ++src;
             }
         } 
         else 
             return -1;
     } 

     if (dst >= dst_end) return -2;

     *dst++ = c1;

     cp_size++;
 }

 return cp_size;
}

Fixed Code

int StringCopy2to4bytes(const unsigned short* src, int src_size, 
                         unsigned int* dst, int dst_size)
{
 int cp_size = 0;

 const unsigned short *src_end = NULL;
 const unsigned int   *dst_end = NULL;

 unsigned int c1, c2;

 src_end = src + src_size;
 dst_end = dst + dst_size;

 while (src < src_end)
 {
     c1 = *src; //FIX
     ++src;

     if ((c1 >= UNI_SUR_HIGH_START) && (c1 <= UNI_SUR_HIGH_END))
     {
         if (src < src_end)
         {
             c2 = *src;
             if ((c2 >= UNI_SUR_LOW_START) && (c2 <= UNI_SUR_LOW_END))
             {
                c1 = ((c1 - UNI_SUR_HIGH_START) << UNI_SHIFT) + 
                      (c1 - UNI_SUR_LOW_START )  + UNI_BASE;

                ++src;
             }
         } 
         else 
             return -1;
     } 

     if (dst >= dst_end) return -2;

     *dst = c1; //FIX
     ++dst;

     cp_size++;
 }

 return cp_size;
}

Edit: for the record, the code isn't mine, I'm just using it, and happen to be debugging it, not that it makes a big difference, but the source is fairly big, so I'm trying to fix it with tweezers as it may be, not refactor everything, anyways bugs are bugs, and I need to fix it and mail the author about what was wrong.

The constants are:

/* unicode constants */
#define UNI_SHIFT             ((int) 10 )
#define UNI_BASE              ((unsigned int) 0x0010000UL)
#define UNI_MASK              ((unsigned int) 0x3FFUL)
#define UNI_REPLACEMENT_CHAR  ((unsigned int) 0x0000FFFD)
#define UNI_MAX_BMP           ((unsigned int) 0x0000FFFF)
#define UNI_MAX_UTF16         ((unsigned int) 0x0010FFFF)
#define UNI_MAX_UTF32         ((unsigned int) 0x7FFFFFFF)
#define UNI_MAX_LEGAL_UTF32   ((unsigned int) 0x0010FFFF)
#define UNI_SUR_HIGH_START    ((unsigned int) 0xD800)
#define UNI_SUR_HIGH_END      ((unsigned int) 0xDBFF)
#define UNI_SUR_LOW_START     ((unsigned int) 0xDC00)
#define UNI_SUR_LOW_END       ((unsigned int) 0xDFFF)
A: 

There shouldn't be a difference, ++ binds higher than *.

A quick test.c doesn't show any difference, either:

#include <stdio.h>

int main()
{
    int a[] = { 0, 1, 2, 3, 4, 5 };
    int * p = a;
    printf( "%d\n", *p );
    printf( "%d\n", *p++ );
    printf( "%d\n", *p );
    printf( "%d\n", *(p++) );
    printf( "%d\n", *p );
    return 0;
}

Gives:

0
0
1
1
2

What makes you think you had a problem that the new code fixed?

Edit: Finding a compiler bug in something that trivial is highly unlikely. Above test was run with GCC 4.1.2.

Edit 2: You've got some type mismatches there. c1 is unsigned int, *src is unsigned short. Sizes should be size_t, not int. Does the problem with the original code persist if you fix those?

DevSolar
Sorry, I didn't get the formatting right.
DevSolar
I agree with that Sizes should be size_t but the entire code base uses int for sizes, and does some checks against size > -1, so I can't change it very easily.
Robert Gould
Erm... yes you can? Simply use size_t where sizes are required. Any int will be casted to size_t automatically. Yes, that's still a type mismatch, but it's at the interface between "your code" and "old code"...Haven't checked the algorithm, Jonathan has some things to say about that...
DevSolar
+4  A: 

The code as written here is still buggy - when you combine c1 and c2 you need to use c2! That is, at the lines:

c1 = ((c1 - UNI_SUR_HIGH_START) << UNI_SHIFT) +
      (c1 - UNI_SUR_LOW_START ) + UNI_BASE;

The third occurrence of c1 should actually be c2.

Additionally, it seems silly to initialize the src_end pointer to null and then to src + src_size. Why not get there straight away?

Also, cp_size could redundant if the start of the string was preserved; it would then be the same as (dst - initial_dst).


Test code - with c1 to c2 fix - using first code example, on Solaris 10, with GCC 4.3.3. Results for 32-bit and 64-bit compilation shown. Data from Table 3.4 in Chapter 3 of the Unicode standard (technically, Unicode 5.0 rather than 5.1.0, but I don't think it matters).

enum { NULL = 0 };
enum { UNI_SUR_HIGH_START = 0xD800, UNI_SUR_HIGH_END = 0xDBFF,
       UNI_SUR_LOW_START  = 0xDC00, UNI_SUR_LOW_END  = 0xDFFF,
       UNI_SHIFT = 10, UNI_BASE = 0x10000 };

int StringCopy2to4bytes(const unsigned short* src, int src_size, 
                         unsigned int* dst, int dst_size)
{
 int cp_size = 0;

 const unsigned short *src_end = NULL;
 const unsigned int   *dst_end = NULL;

 unsigned int c1, c2;

 src_end = src + src_size;
 dst_end = dst + dst_size;

 while (src < src_end)
 {
     c1 = *src++;
     if ((c1 >= UNI_SUR_HIGH_START) && (c1 <= UNI_SUR_HIGH_END))
     {
         if (src < src_end)
         {
             c2 = *src;
             if ((c2 >= UNI_SUR_LOW_START) && (c2 <= UNI_SUR_LOW_END))
             {
                c1 = ((c1 - UNI_SUR_HIGH_START) << UNI_SHIFT) + 
                      (c2 - UNI_SUR_LOW_START )  + UNI_BASE;    /* Fixed */

                ++src;
             }
         } 
         else 
             return -1;
     } 

     if (dst >= dst_end) return -2;

     *dst++ = c1;

     cp_size++;
 }

 return cp_size;
}

#include <stdio.h>
#include <stdlib.h>

int main(void)
{
    unsigned short w2_chars[] = { 0x004D, 0x0430, 0x4E8C, 0xD800, 0xDF02, 0x004D };
    unsigned int   w4_wanted[] = { 0x00004D, 0x000430, 0x004E8C, 0x010302, 0x00004D };
    unsigned int   w4_actual[5];
    int w2_len = 6;
    int w4_len = 5;
    int w4_actlen;
    int i;
    int failed = 0;

    w4_actlen = StringCopy2to4bytes(w2_chars, w2_len, w4_actual, w4_len);
    if (w4_actlen != w4_len)
    {
        failed = 1;
        printf("Length mismatch: got %d, wanted %d\n", w4_actlen, w4_len);
    }
    for (i = 0; i < w4_len; i++)
    {
        if (w4_actual[i] != w4_wanted[i])
        {
            printf("Mismatch: index %d: wanted 0x%06X, actual 0x%06X\n",
                   i, w4_wanted[i], w4_actual[i]);
            failed = 1;
        }
    }
    if (failed == 0)
        printf("No problem observed\n");
    return((failed == 0) ? EXIT_SUCCESS : EXIT_FAILURE);
}


$ gcc -m32 -O utf.c -o utf32 && ./utf32
No problem observed
$ gcc -m64 -O utf.c -o utf64 && ./utf64
No problem observed
$

I'm left wondering what is up with your compiler - or your test case.


Here's a revised version of the StringCopy2to4bytes() function. It detects and reports an error condition that the original did not - namely when the second word of surrogate pair is not a valid low surrogate code point, it returns status -3.

int StringCopy2to4bytes(const unsigned short *src, int src_size, 
                        unsigned int *dst, int dst_size)
{
    const unsigned short *src_end = src + src_size;
    const unsigned int   *dst_end = dst + dst_size;
    const unsigned int   *dst0    = dst;

    while (src < src_end)
    {
        unsigned int c1 = *src++;
        if ((c1 >= UNI_SUR_HIGH_START) && (c1 <= UNI_SUR_HIGH_END))
        {
            if (src >= src_end)
                return -1;
            unsigned int c2 = *src++;
            if ((c2 >= UNI_SUR_LOW_START) && (c2 <= UNI_SUR_LOW_END))
            {
               c1 = ((c1 - UNI_SUR_HIGH_START) << UNI_SHIFT) + 
                     (c2 - UNI_SUR_LOW_START )  + UNI_BASE;    /* Fixed */
            }
            else
                return -3;  /* Invalid second code point in surrogate pair */
        } 
        if (dst >= dst_end)
            return -2; 
        *dst++ = c1;
    }
    return dst - dst0;
}

The same test code produces the same clean bill of health. The declaration of c2 assumes you are using C99 - not C89.

Jonathan Leffler
uh? What do you mean?
Robert Gould
I think he's talking about lines 24/25 from example 1. I wonder if the pointer arithmetic is a red herring where the problem is really this.
Michael Burr
Added the values of the constants to the question
Robert Gould
Thanks for the Unicode fix! Anyways the issue was a rare buffer overflow not the pointer arithmetics:/
Robert Gould
Had to add unicode to the question so this could be an answer :)
Robert Gould
@Robert - Just curious was the buffer overflow a bug in this routine or was an incorrect size being passed in (for example, size in bytes instead of size in chars)?
Michael Burr
A buffer overflow makes more sense than a compiler bug, which was an outside possibility at most. I'm as curious as Michael about how the buffer overflow occurred. The function basically looks sound and safe, so I assume it was being misused.
Jonathan Leffler
The buffer was shorter than the advertised size. Specifically some buffer sizes were being rounded to even numbers, when they should have been, say buffer[3] was actually buffer[2], so it was writing out of bounds
Robert Gould
That'll do it every time. Compilers object when you lie to them, and usually find a way to get their own back. Bad luck on having to debug someone else's buffer overflows.
Jonathan Leffler
+1  A: 

This smells like we could need some () here.

Look at the difference

  1. c1 = *src++;
  2. c1 = (*src)++;
  3. c1 = *(src++);

I really like () since they remove some ambiguity on what the programmer wants to do.

/Johan

Johan