views:

1830

answers:

4

I have a Java application that receives data over a socket using an InputStreamReader. It reports "Cp1252" from its getEncoding method:

/* java.net. */ Socket Sock = ...;
InputStreamReader is = new InputStreamReader(Sock.getInputStream());
System.out.println("Character encoding = " + is.getEncoding());
// Prints "Character encoding = Cp1252"

That doesn't necessarily match what the system reports as its code page. For example:

C:\>chcp
Active code page: 850

The application may receive byte 0x81, which in code page 850 represents the character ü. The program interprets that byte with code page 1252, which doesn't define any character at that value, so I get a question mark instead.

I was able to work around this problem for one customer who used code page 850 by adding another command-line option in the batch file that launches the application:

java.exe -Dfile.encoding=Cp850 ...

But not all my customers use code page 850, of course. How can I get Java to use a code page that's compatible with the underlying Windows system? My preference would be something I could just put in the batch file, leaving the Java code untouched:

ENC=...
java.exe -Dfile.encoding=%ENC% ...
+4  A: 

In regards to the code snippit, the right answer is to use the appropriate constructor for InputStreamReader that does the correct code conversion. That way it won't matter what encoding the default on the system is, you know you are getting a correct encoding that corresponds to what you are getting on the socket.

Then you can specify the encoding when you write out files if you need to, rather than relying on the system encoding, but of course when they open files on that system they may have issues, but modern windows systems support UTF-8, so you can write out the file in UTF-8 if you need to (internally Java is representing all Strings as 16 bit unicode).

I would think this is the "right" solution in general that would be most compatible with largest range of underlying systems.

Yishai
+1. BTW On my Windows 7 system the active code page is 850, but Java reports "Cp1252" as the "file.encoding" system property.
Vinay Sajip
The clients and server are to be configured with the same encoding, whatever that might be for any given customer. A non-Java app sends character data to the server using the local code page, the server stores the data, and later the server sends it to the Java app. Nobody stores what the code page is, because as long as everyone used the same one, it didn't matter. The problem is that the Java app doesn't cooperate; it always uses Cp1252. (The "right" solution is to change the protocol to force everything to, say, UTF-8, but a protocol change breaks all existing installations.)
Rob Kennedy
Then it sounds like G_A has your answer. Another option is to have that non-java app report to your java application what it thinks the encoding is, and then use the appropriate constructor, as outlined above.
Yishai
+4  A: 

Windows has the added complication of having two active codepages. In your example both 1252 and 850 are correct, but they depend on the way the program is being run. For GUI applications, Windows will use the ANSI code page, which for Western European languages will typically be 1252. However, the command line will report the OEM codepage which is 850 for the same locales.

ferdley
You've made true statements, but I'm not sure how they answer my question. Evidently, the OEM code page is the one the Java program needs to be compatible with. So, how do I choose a `file.encoding` value based on that? The way the program is being run is via `java.exe`.
Rob Kennedy
+3  A: 

If the code page value that comes back from a chcp command will return the value that you need, you can use the following command to get the code page

C:\>for /F "Tokens=4" %I in ('chcp') Do Set CodePage=%I

This sets the variable CodePage to the code page value returned from chcp

C:\>echo %CodePage%
437

You could use this value in your bat file by prefixing it with Cp

C:\>echo Cp%CodePage%
Cp437

If when you put this into a bat file, the %I values in the first command will need to be replaced with %%I

G_A
This seemed promising, but it relies on certain assumptions about the format of the `chcp` output, which can differ on non-English systems. In German, for instance, the code page is in token 3, and there's a period after the number: "Aktive Codepage: 850."
Rob Kennedy
+4  A: 

The default encoding used by cmd.exe is Cp850 (or whatever "OEM" CP is native to the OS); the system encoding is Cp1252 (or whatever "ANSI" CP is native to the OS). Gory details here. One way to discover the console encoding would be to do it via native code (see GetConsoleOutputCP for current console encoding; see GetACP for default "ANSI" encoding; etc.).

Altering the encoding via the -D switch is going to affect all your default encoding mechanisms, including redirected stdout/stdin/stderr. It is not an ideal solution.

I came up with this WSH script that can set the console to the system ANSI codepage, but haven't figured out how to programmatically switch to a TrueType font.

'file:  setacp.vbs
'usage: cscript /Nologo setacp.vbs
Set objShell = CreateObject("WScript.Shell")
'replace ACP (ANSI) with OEMCP for default console CP
cp = objShell.RegRead("HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001" &_
                              "\Control\Nls\CodePage\ACP")
WScript.Echo "Switching console code page to " & cp
objShell.Exec "chcp.com " & cp

(This is my first WSH script, so it may be flawed - I'm not familiar with registry read permissions.)

Using a TrueType font is another requirement for using ANSI/Unicode with cmd.exe. I'm going to look at a programmatic switch to a better font when time permits.

McDowell