views:

203

answers:

2

Hi,

I am trying to create a DLL for authentication using Java and JNI.

To create the DLL, I have created a Win32 application whose Character Set and Runtime Library information are Multi-Byte String and Multi-threaded (/MT) respectively.

I have tested the DLL on WinXP with valid and invalid user credentials. Both work fine.

I need to know whether the same DLL will work in Chinese / Japanese environments as well.

Can anyone help me out with this issue?

Thanks in advance.

Regards

Jegan K S

+1  A: 

It should work fine, if you only ever treat strings as blobs. When you start accessing them "char-by-char" (i.e. byte-by-byte), things might go wrong if you assume a C char is a complete charater. Likewise, if you assume you can split the string in the middle into two substrings, it might go wrong, etc.

Also, a question is how you convert a Java string into such a multi-byte string; there are right and wrong ways to do so.

Martin v. Löwis
JNI has a standard way to get access to the bytes in a Java string (GetStringUTFChars if I remember right), which means it comes out as UTF-8. Thus, if it handles any non-ASCII at all (including your name, I see), then it should handle CJK characters just fine.
Chris Jester-Young
One oddity I've known (I don't know if it's still the case) is that the JVM UTF-8 support doesn't handle 4-byte sequences (for code points outside BMP) directly. Instead, it uses two 3-byte sequences corresponding to the UTF-16 surrogate pair representation. It's seriously ick.
Chris Jester-Young
The latter representation is called CESU: http://www.unicode.org/reports/tr26/
Chris Jester-Young
If the Java UTF-8 function has been used, then it will definitely fail in the MBCS API of Windows, which requires text encoded in the ANSI code page, not UTF-8.
Martin v. Löwis
Agree, I neglected to see the bit in the question about MBCS API. I do think, then, that using the Unicode API is more ideal, but in any case, the OP should use GetStringChars (not GetStringUTFChars), then either use it straight (Unicode API) or use WideCharToMultiByte to convert it (MBCS API).
Chris Jester-Young
Chris: I completely agree. Using MBCS is probably a bad choice.
Martin v. Löwis
+1  A: 

What Martin writes is true:

It should work fine, if you only ever treat strings as blobs. When you start accessing them "char-by-char" (i.e. byte-by-byte), things might go wrong if you assume a C char is a complete charater. Likewise, if you assume you can split the string in the middle into two substrings, it might go wrong, etc.

But it's worse than that. Running on a Japanese or Chinese system merely makes it more likely your code will encounter multi-byte (non-ASCII) text. Even running on a US English system (the simplest case), it's entirely possible your code will encounter multi-byte (non-ASCII) text. Don't assume the strings used in the user interface by default are the limit of what you might encounter.

Also note that converting your project to "Unicode" (as Microsoft calls it) won't help because Microsoft's choice of Unicode encoding is UTF-16, which has similar problems (less often). (In UTF-16, the term to look out for is "surrogate pair".)

Text processing is hard. Let's go shopping!

Integer Poet