views:

24804

answers:

9

How do I properly set the default character encoding used by the JVM (1.5.x) programmatically?

I have read that -Dfile.encoding=whatever used to be the way to go for older JVMs... I don't have that luxury for reasons I wont get into.

I have tried:

System.setProperty("file.encoding", "UTF8");

And the property gets set, but it doesn't seem to cause the final getBytes call below to use UTF8:

    System.setProperty("file.encoding", "UTF-8");

    byte inbytes[] = new byte[1024];

    FileInputStream fis = new FileInputStream("response.txt");
    fis.read(inbytes);
    FileOutputStream fos = new FileOutputStream("response-2.txt");
    String in = new String(inbytes, "UTF8");
    fos.write(in.getBytes());

Any help appreciated...

+4  A: 

I can't answer your original question but I would like to offer you some advice -- don't depend on the JVM's default encoding. It's always best to explicitly specify the desired encoding (i.e. "UTF-8") in your code. That way, you know it will work even across different systems and JVM configurations.

Marc Novakowski
Except, of course, if you're writing a desktop app and processing some user-specified text that does not have any encoding metadata - then the platform default encoding is your best guess as to what the user might be using.
Michael Borgwardt
That's a good point - I guess I'm used to writing server-side Java code. :)
Marc Novakowski
+3  A: 

I think a better approach than setting the platform's default character set, especially as you seem to have restrictions on affecting the application deployment, let alone the platform, is to call the much safer String.getBytes("charsetName"). That way your application is not dependent on things beyond its control.

I personally feel that String.getBytes() should be deprecated, as it has caused serious problems in a number of cases I have seen, where the developer did not account for the default charset possibly changing.

Dov Wasserman
A: 

Excellent comments guys - and things I was already thinking myself. Unfortunately there is an underlying String.getBytes() call that I have no control over. The only way I currently see to get around it is to set the default encoding programmatically.

Any other suggestions?

Every work around on charset related issue in Java cause data lost and huge performance lost. I'd suggest you try your best to fix/take control the underlying getBytes.The implementation of String is quite JRE vendor and version specific; you will not able to ensure the behavior same on all JRE.
Dennis Cheung
+11  A: 

Unfortunately, the file.encoding property has to be specified as the JVM starts up; by the time your main method is entered, the character encoding used by String.getBytes() and the default constructors of InputStreamReader and OutputStreamWriter has been permanently cached.

Charset.defaultCharset() will reflect changes to the file.encoding property, but most of the code in the core Java libraries that need to determine the default character encoding do not use this mechanism.

When you are encoding or decoding, you can query the file.encoding property or Charset.defaultCharset() to find the current default encoding, and use the appropriate method or constructor overload to specify it.

erickson
+1  A: 

Are you sure you cannot set the encoding at JVM start-up time?

If not, can you subclass the offending class to fix the call to getBytes()? Or maybe re-encode the byte array at a convenient location?

Thilo
A: 

Not clear on what you do and don't have control over at this point. If you can interpose a different OutputStream class on the destination file, you could use a subtype of OutputStream which converts Strings to bytes under a charset you define, say UTF-8 by default. If modified UTF-8 is suffcient for your needs, you can use DataOutputStream.writeUTF(String):

byte inbytes[] = new byte[1024];
FileInputStream fis = new FileInputStream("response.txt");
fis.read(inbytes);
String in = new String(inbytes, "UTF8");
DataOutputStream out = new DataOutputStream(new FileOutputStream("response-2.txt"));
out.writeUTF(in); // no getBytes() here

If this approach is not feasible, it may help if you clarify here exactly what you can and can't control in terms of data flow and execution environment (though I know that's sometimes easier said than determined). Good luck.

Dov Wasserman
DataInputStream and DataOutputStream are special-purpose classes that should never be used with plain text files. The modified UTF-8 they employ is not compatible with real UTF-8. Besides, if the OP could use your solution, he could also use the right tool for this job: an OutputStreamWriter.
Alan Moore
+6  A: 

From the JVM™ Tool Interface documentation…

Since the command-line cannot always be accessed or modified, for example in embedded VMs or simply VMs launched deep within scripts, a JAVA_TOOL_OPTIONS variable is provided so that agents may be launched in these cases.

By setting the (Windows) environment variable JAVA_TOOL_OPTIONS to -Dfile.encoding=UTF8, the (Java) System property will be set automatically every time a JVM is started. You will know that the parameter has been picked up because the following message will be posted to System.err:

Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8

Edward Grech
A: 

I had the same problem reading UTF-8 files.
I did as follows and UTF-8 files are read properly:

new BufferedReader(
    new InputStreamReader(
        new FileInputStream(file), UTF8));
Shouldn't UTF8 be in "quotes"?
Michael Myers
@mmyers: Yes, of course.
sleske
The bracketing is also incorrect... it's pseudocode :-).
sleske
A: 

-Dfile.encoding is deprecated. You should set the underlying system locale!

Can you give link to support this assertion ? I can't find despite looking.

Guest
http://bugs.sun.com/view_bug.do?bug_id=4163515
McDowell