views:

97

answers:

3

I am trying to create a simple datastructure that will make it easy to convert back and forth between ASCII strings and Unicode strings. My issue is that the length returned by the function mbstowcs is correct but the length returned by the function wcslen, on the newly created wchar_t string, is not. Am I missing something here?

typedef struct{

    wchar_t *string;
    long length; // I have also tried int, and size_t
} String;

void setCString(String *obj, char *str){

    obj->length = strlen(str);

    free(obj->string); // Free original string
    obj->string = (wchar_t *)malloc((obj->length + 1) * sizeof(wchar_t)); //Allocate space for new string to be copied to

    //memset(obj->string,'\0',(obj->length + 1)); NOTE: I tried this but it doesn't make any difference

    size_t length = 0;

    length = mbstowcs(obj->string, (const char *)str, obj->length);

    printf("Length = %d\n",(int)length); // Prints correct length
    printf("!C string %s converted to wchar string %ls\n",str,obj->string); //obj->string is of a wcslen size larger than Length above...

    if(length != wcslen(obj->string))
            printf("Length failure!\n");

    if(length == -1)
    {
        //Conversion failed, set string to NULL terminated character
        free(obj->string);
        obj->string = (wchar_t *)malloc(sizeof(wchar_t));
        obj->string = L'\0';
    }
    else
    {
        //Conversion worked! but wcslen (and printf("%ls)) show the string is actually larger than length
        //do stuff
    }
}
+2  A: 

The code seems to work fine for me. Can you provide more context, such as the content of strings you're passing to it, and what locale you're using?

A few other bugs/style issues I noticed:

  • obj->length is left as the allocated length, rather than updated to match the length in (wide) characters. Is that your intention?
  • The cast to const char * is useless and bad style.

Edit: Upon discussion, it looks like you may be using a nonconformant Windows version of the mbstowcs function. If so, your question should be updated to reflect as such.

Edit 2: The code only happened to work for me because malloc returned a fresh, zero-filled buffer. Since you are passing obj->length to mbstowcs as the maximum number of wchar_t values to write to the destination, it will run out of space and not be able to write the null terminator unless there's a proper multibyte character (one which requires more than a single byte) in the source string. Change this to obj->length+1 and it should work fine.

R..
A: 

I am running this on Ubuntu linux with UTF-8 as locale.

Here is the additional info as requested:

I am calling this function with a fully allocated structure and passing in a hard coded "string" (not a L"string"). so I call the function with what is essentially setCString(*obj, "Hello!").

Length = 6

!C string Hello! converted to wchar string Hello!xxxxxxxxxxxxxxxxxxxx

(where x = random data)

Length failure!

for reference printf("wcslen = %d\n",(int)wcslen(obj->string)); prints out as wcslen = 11

Tyler
Actually I'm not sure of the UTF-8 part because I read somewhere that gcc defaults everything to UTF-32. Of course I could be wrong on this assumption as well...
Tyler
If you've put "random data" in for 'xxxxxxx' then `mbstowcs` will almost surely fail with `errno==EILSEQ`, returning `(size_t)-1` (since "random data" is not likely to be valid UTF-8), but `wcslen` will report the length of the successfully-converted part plus whatever junk is already in the output buffer, since it won't get null-terminated.
R..
No sorry, my converted wchar string has random bytes tacked on the end for some reason, that's the problem. It's like it doesn't have \0 until it randomly hits one. I thought mbstowcs was supposed to copy the string terminating null byte(s?) when it did the conversion.
Tyler
Because you're passing `obj->length` and not `obj->length+1` to `mbstowcs`, it cannot null-terminate. The code only happened to work for me because `malloc` returned fresh (zero-filled) memory. It will also happen to work if there are any actual multi-byte characters since then your destination will have extra space. BTW your `memset` never helped because you forgot to multiply by `sizeof(wchar_t)` (or better, use `wmemset`).
R..
If my answer solved your problem, please accept it. If not, please follow up so that I (or someone else) can finish answering.
R..
A: 

The length you need to pass to mbstowcs() includes the L'\0' terminator character, but your calculated length in obj->length() does not include it - you need to add 1 to the value passed to mbstowcs().

In addition, instead of using strlen(str) to determine the length of the converted string, you should be using mbstowcs(0, src, 0) + 1. You should also change the type of str to const char *, and elide the cast. realloc() can be used in place of a free() / malloc() pair. Overall, it should look like:

typedef struct {
    wchar_t *string;
    size_t length;
} String;

void setCString(String *obj, const char *str)
{
    obj->length = mbstowcs(0, src, 0);
    obj->string = realloc(obj->string, (obj->length + 1) * sizeof(wchar_t)); 

    size_t length = mbstowcs(obj->string, str, obj->length + 1);

    printf("Length = %zu\n", length);
    printf("!C string %s converted to wchar string %ls\n", str, obj->string);

    if (length != wcslen(obj->string))
            printf("Length failure!\n");

    if (length == (size_t)-1)
    {
        //Conversion failed, set string to NULL terminated character
        obj->string = realloc(obj->string, sizeof(wchar_t));
        obj->string = L'\0';
    }
    else
    {
        //Conversion worked!
        //do stuff
    }
}
caf
Works perfect. Thanks caf and R...
Tyler
@Tyler note that in the failure case, `obj->length` is not reset to 1. I'm not sure that the `realloc()` in that case is worth it, either - you might as well leave the larger block allocated.
caf