ansaurus

Question

What is the difference between these two versions of code (pointer arithmetics & unicode) ?

Answer 1

A:

There shouldn't be a difference, ++ binds higher than *.

A quick test.c doesn't show any difference, either:

#include <stdio.h>

int main()
{
    int a[] = { 0, 1, 2, 3, 4, 5 };
    int * p = a;
    printf( "%d\n", *p );
    printf( "%d\n", *p++ );
    printf( "%d\n", *p );
    printf( "%d\n", *(p++) );
    printf( "%d\n", *p );
    return 0;
}

Gives:

What makes you think you had a problem that the new code fixed?

Edit: Finding a compiler bug in something that trivial is highly unlikely. Above test was run with GCC 4.1.2.

Edit 2: You've got some type mismatches there. c1 is unsigned int, *src is unsigned short. Sizes should be size_t, not int. Does the problem with the original code persist if you fix those?

DevSolar 2009-02-16 04:09:31

Sorry, I didn't get the formatting right.

DevSolar 2009-02-16 04:16:01

I agree with that Sizes should be size_t but the entire code base uses int for sizes, and does some checks against size > -1, so I can't change it very easily.

Robert Gould 2009-02-16 04:24:32

Erm... yes you can? Simply use size_t where sizes are required. Any int will be casted to size_t automatically. Yes, that's still a type mismatch, but it's at the interface between "your code" and "old code"...Haven't checked the algorithm, Jonathan has some things to say about that...

DevSolar 2009-02-16 04:28:31

Answer 2

+4 A:

The code as written here is still buggy - when you combine c1 and c2 you need to use c2! That is, at the lines:

c1 = ((c1 - UNI_SUR_HIGH_START) << UNI_SHIFT) +
      (c1 - UNI_SUR_LOW_START ) + UNI_BASE;

The third occurrence of c1 should actually be c2.

Additionally, it seems silly to initialize the src_end pointer to null and then to src + src_size. Why not get there straight away?

Also, cp_size could redundant if the start of the string was preserved; it would then be the same as (dst - initial_dst).

Test code - with c1 to c2 fix - using first code example, on Solaris 10, with GCC 4.3.3. Results for 32-bit and 64-bit compilation shown. Data from Table 3.4 in Chapter 3 of the Unicode standard (technically, Unicode 5.0 rather than 5.1.0, but I don't think it matters).

enum { NULL = 0 };
enum { UNI_SUR_HIGH_START = 0xD800, UNI_SUR_HIGH_END = 0xDBFF,
       UNI_SUR_LOW_START  = 0xDC00, UNI_SUR_LOW_END  = 0xDFFF,
       UNI_SHIFT = 10, UNI_BASE = 0x10000 };

int StringCopy2to4bytes(const unsigned short* src, int src_size, 
                         unsigned int* dst, int dst_size)
{
 int cp_size = 0;

 const unsigned short *src_end = NULL;
 const unsigned int   *dst_end = NULL;

 unsigned int c1, c2;

 src_end = src + src_size;
 dst_end = dst + dst_size;

 while (src < src_end)
 {
     c1 = *src++;
     if ((c1 >= UNI_SUR_HIGH_START) && (c1 <= UNI_SUR_HIGH_END))
     {
         if (src < src_end)
         {
             c2 = *src;
             if ((c2 >= UNI_SUR_LOW_START) && (c2 <= UNI_SUR_LOW_END))
             {
                c1 = ((c1 - UNI_SUR_HIGH_START) << UNI_SHIFT) + 
                      (c2 - UNI_SUR_LOW_START )  + UNI_BASE;    /* Fixed */

                ++src;
             }
         } 
         else 
             return -1;
     } 

     if (dst >= dst_end) return -2;

     *dst++ = c1;

     cp_size++;
 }

 return cp_size;
}

#include <stdio.h>
#include <stdlib.h>

int main(void)
{
    unsigned short w2_chars[] = { 0x004D, 0x0430, 0x4E8C, 0xD800, 0xDF02, 0x004D };
    unsigned int   w4_wanted[] = { 0x00004D, 0x000430, 0x004E8C, 0x010302, 0x00004D };
    unsigned int   w4_actual[5];
    int w2_len = 6;
    int w4_len = 5;
    int w4_actlen;
    int i;
    int failed = 0;

    w4_actlen = StringCopy2to4bytes(w2_chars, w2_len, w4_actual, w4_len);
    if (w4_actlen != w4_len)
    {
        failed = 1;
        printf("Length mismatch: got %d, wanted %d\n", w4_actlen, w4_len);
    }
    for (i = 0; i < w4_len; i++)
    {
        if (w4_actual[i] != w4_wanted[i])
        {
            printf("Mismatch: index %d: wanted 0x%06X, actual 0x%06X\n",
                   i, w4_wanted[i], w4_actual[i]);
            failed = 1;
        }
    }
    if (failed == 0)
        printf("No problem observed\n");
    return((failed == 0) ? EXIT_SUCCESS : EXIT_FAILURE);
}


$ gcc -m32 -O utf.c -o utf32 && ./utf32
No problem observed
$ gcc -m64 -O utf.c -o utf64 && ./utf64
No problem observed
$

I'm left wondering what is up with your compiler - or your test case.

Here's a revised version of the StringCopy2to4bytes() function. It detects and reports an error condition that the original did not - namely when the second word of surrogate pair is not a valid low surrogate code point, it returns status -3.

int StringCopy2to4bytes(const unsigned short *src, int src_size, 
                        unsigned int *dst, int dst_size)
{
    const unsigned short *src_end = src + src_size;
    const unsigned int   *dst_end = dst + dst_size;
    const unsigned int   *dst0    = dst;

    while (src < src_end)
    {
        unsigned int c1 = *src++;
        if ((c1 >= UNI_SUR_HIGH_START) && (c1 <= UNI_SUR_HIGH_END))
        {
            if (src >= src_end)
                return -1;
            unsigned int c2 = *src++;
            if ((c2 >= UNI_SUR_LOW_START) && (c2 <= UNI_SUR_LOW_END))
            {
               c1 = ((c1 - UNI_SUR_HIGH_START) << UNI_SHIFT) + 
                     (c2 - UNI_SUR_LOW_START )  + UNI_BASE;    /* Fixed */
            }
            else
                return -3;  /* Invalid second code point in surrogate pair */
        } 
        if (dst >= dst_end)
            return -2; 
        *dst++ = c1;
    }
    return dst - dst0;
}

The same test code produces the same clean bill of health. The declaration of c2 assumes you are using C99 - not C89.

Jonathan Leffler 2009-02-16 04:10:51

uh? What do you mean?

Robert Gould 2009-02-16 04:12:18

I think he's talking about lines 24/25 from example 1. I wonder if the pointer arithmetic is a red herring where the problem is really this.

Michael Burr 2009-02-16 04:17:20

Added the values of the constants to the question

Robert Gould 2009-02-16 04:50:27

Thanks for the Unicode fix! Anyways the issue was a rare buffer overflow not the pointer arithmetics:/

Robert Gould 2009-02-16 06:55:10

Had to add unicode to the question so this could be an answer :)

Robert Gould 2009-02-16 06:57:02

@Robert - Just curious was the buffer overflow a bug in this routine or was an incorrect size being passed in (for example, size in bytes instead of size in chars)?

Michael Burr 2009-02-16 07:00:10

A buffer overflow makes more sense than a compiler bug, which was an outside possibility at most. I'm as curious as Michael about how the buffer overflow occurred. The function basically looks sound and safe, so I assume it was being misused.

Jonathan Leffler 2009-02-16 07:06:33

The buffer was shorter than the advertised size. Specifically some buffer sizes were being rounded to even numbers, when they should have been, say buffer[3] was actually buffer[2], so it was writing out of bounds

Robert Gould 2009-02-16 07:27:58

That'll do it every time. Compilers object when you lie to them, and usually find a way to get their own back. Bad luck on having to debug someone else's buffer overflows.

Jonathan Leffler 2009-02-16 08:16:29

Answer 3

+1 A:

This smells like we could need some () here.

Look at the difference

c1 = *src++;
c1 = (*src)++;
c1 = *(src++);

I really like () since they remove some ambiguity on what the programmer wants to do.

/Johan

Johan 2009-02-16 05:52:37

ansaurus

tags:

views:

answers:

What is the difference between these two versions of code (pointer arithmetics & unicode) ?

related questions