views:

2183

answers:

2

I have a Java application that uses a C++ DLL via JNI. A few of the DLL's methods take string arguments, and some of them return objects that contain strings as well.

Currently the DLL does not support Unicode, so the string handling is rather easy:

  • Java calls String.getBytes() and passes the resulting array to the DLL, which simply treats the data as a char*.
  • DLL uses NewStringUTF() to create a jstring from a const char*.

I'm now in the process of modifying the DLL to support Unicode, switching to using the TCHAR type (which when UNICODE is defined uses windows' WCHAR datatype). Modifying the DLL is going well, but I'm not sure how to modify the JNI portion of the code.

The only thing I can think of right now is this:

  • Java calls String.getBytes(String charsetName) and passes the resulting array to the DLL, which treats the data as a wchar_t*.
  • DLL no longer creates Strings, but instead passes jbyteArrays with the raw string data. Java uses the String(byte[] bytes, String charsetName) constructor to actually create the String.

The only problem with this method is that I'm not sure what charset name to use. WCHARs are 2-bytes long, so I'm pretty sure it's UTF-16, but there are 3 posibilities on the java side. UTF-16, UTF-16BE, and UTF-16LE. I haven't found any documentation that tells me what the byte order is, but I can probably figure it out from some quick testing.

Is there a better way? If possible I'd like to continue constructing the jstring objects within the DLL, as that way I won't have to modify any of the usages of those methods. However, the NewString JNI method doesn't take a charset identifier.

+3  A: 

This answer suggests that the byte-ordering of WCHARS is not guaranteed...

Since you are on Windows you could try WideCharToMultiByte to convert the WCHARs to UTF-8 and then use your existing JNI code.

You will need to be careful using WideCharToMultiByte due to the possibility of buffer overruns in the lpMultiByteStr parameter. To get round this you should call the function twice, first with lpMultiByteStr set to NULL and cbMultiByte set to zero - this will return the length of the required lpMultiByteStr buffer without attempting to write to it. Once you have the length you can allocate a buffer of the required size and call the function again.

Example code:

int utf8_length;

wchar_t* utf16 = ...;

utf8_length = WideCharToMultiByte(
  CP_UTF8,           // Convert to UTF-8
  0,                 // No special character conversions required 
                     // (UTF-16 and UTF-8 support the same characters)
  utf16,             // UTF-16 string to convert
  -1,                // utf16 is NULL terminated (if not, use length)
  NULL,              // Determining correct output buffer size
  0,                 // Determining correct output buffer size
  NULL,              // Must be NULL for CP_UTF8
  NULL);             // Must be NULL for CP_UTF8

if (utf8_length == 0) {
  // Error - call GetLastError for details
}

char* utf8 = ...; // Allocate space for UTF-8 string

utf8_length = WideCharToMultiByte(
  CP_UTF8,           // Convert to UTF-8
  0,                 // No special character conversions required 
                     // (UTF-16 and UTF-8 support the same characters)
  utf16,             // UTF-16 string to convert
  -1,                // utf16 is NULL terminated (if not, use length)
  utf8,              // UTF-8 output buffer
  utf8_length,       // UTF-8 output buffer size
  NULL,              // Must be NULL for CP_UTF8
  NULL);             // Must be NULL for CP_UTF8

if (utf8_length == 0) {
  // Error - call GetLastError for details
}
Matthew Murdoch
Hm, hadn't considered converting the wide-char string to a utf-8 string first. I assume to use that method I'd want the CP_UTF8 codepage argument?
Herms
Yes, the CodePage argument must be CP_UTF8.
Matthew Murdoch
Thanks for the example code. I wasn't completely sure about a couple of those arguments, and it's nice to have confirmation that I guessed right. :)
Herms
+2  A: 

I found a little faq about the byte order mark. Also from that FAQ:

UTF-16 and UTF-32 use code units that are two and four bytes long respectively. For these UTFs, there are three sub-flavors: BE, LE and unmarked. The BE form uses big-endian byte serialization (most significant byte first), the LE form uses little-endian byte serialization (least significant byte first) and the unmarked form uses big-endian byte serialization by default, but may include a byte order mark at the beginning to indicate the actual byte serialization used.

I'm assuming on the java side the UTF-16 will try to find this BOM and properly deal with the encoding. We all know how dangerous assumptions can be...

Edit because of comment:

Microsoft uses UTF16 little endian. Java UTF-16 tries to interpret the BOM. When lacking a BOM it defaults to UTF-16BE. The BE and LE variants ignore the BOM.

Onots
Oh, I know what the different UTF-16 versions are, I just don't know which one Windows is actually using for WCHAR.
Herms