ansaurus

Question

Passing double-byte (WCHAR) strings from C++ to Java via JNI.

Answer 1

+3 A:

This answer suggests that the byte-ordering of WCHARS is not guaranteed...

Since you are on Windows you could try WideCharToMultiByte to convert the WCHARs to UTF-8 and then use your existing JNI code.

You will need to be careful using WideCharToMultiByte due to the possibility of buffer overruns in the lpMultiByteStr parameter. To get round this you should call the function twice, first with lpMultiByteStr set to NULL and cbMultiByte set to zero - this will return the length of the required lpMultiByteStr buffer without attempting to write to it. Once you have the length you can allocate a buffer of the required size and call the function again.

Example code:

int utf8_length;

wchar_t* utf16 = ...;

utf8_length = WideCharToMultiByte(
  CP_UTF8,           // Convert to UTF-8
  0,                 // No special character conversions required 
                     // (UTF-16 and UTF-8 support the same characters)
  utf16,             // UTF-16 string to convert
  -1,                // utf16 is NULL terminated (if not, use length)
  NULL,              // Determining correct output buffer size
  0,                 // Determining correct output buffer size
  NULL,              // Must be NULL for CP_UTF8
  NULL);             // Must be NULL for CP_UTF8

if (utf8_length == 0) {
  // Error - call GetLastError for details
}

char* utf8 = ...; // Allocate space for UTF-8 string

utf8_length = WideCharToMultiByte(
  CP_UTF8,           // Convert to UTF-8
  0,                 // No special character conversions required 
                     // (UTF-16 and UTF-8 support the same characters)
  utf16,             // UTF-16 string to convert
  -1,                // utf16 is NULL terminated (if not, use length)
  utf8,              // UTF-8 output buffer
  utf8_length,       // UTF-8 output buffer size
  NULL,              // Must be NULL for CP_UTF8
  NULL);             // Must be NULL for CP_UTF8

if (utf8_length == 0) {
  // Error - call GetLastError for details
}

Matthew Murdoch 2009-05-15 19:35:54

Hm, hadn't considered converting the wide-char string to a utf-8 string first. I assume to use that method I'd want the CP_UTF8 codepage argument?

Herms 2009-05-15 19:49:47

Yes, the CodePage argument must be CP_UTF8.

Matthew Murdoch 2009-05-16 07:47:44

Thanks for the example code. I wasn't completely sure about a couple of those arguments, and it's nice to have confirmation that I guessed right. :)

Herms 2009-05-18 13:14:12

Answer 2

+2 A:

I found a little faq about the byte order mark. Also from that FAQ:

UTF-16 and UTF-32 use code units that are two and four bytes long respectively. For these UTFs, there are three sub-flavors: BE, LE and unmarked. The BE form uses big-endian byte serialization (most significant byte first), the LE form uses little-endian byte serialization (least significant byte first) and the unmarked form uses big-endian byte serialization by default, but may include a byte order mark at the beginning to indicate the actual byte serialization used.

I'm assuming on the java side the UTF-16 will try to find this BOM and properly deal with the encoding. We all know how dangerous assumptions can be...

Edit because of comment:

Microsoft uses UTF16 little endian. Java UTF-16 tries to interpret the BOM. When lacking a BOM it defaults to UTF-16BE. The BE and LE variants ignore the BOM.

Onots 2009-05-15 19:37:08

Oh, I know what the different UTF-16 versions are, I just don't know which one Windows is actually using for WCHAR.

Herms 2009-05-15 19:48:08

ansaurus

tags:

views:

answers:

Passing double-byte (WCHAR) strings from C++ to Java via JNI.

related questions