ansaurus

Question

Setting the default Java character encoding?

Answer 1

+4 A:

I can't answer your original question but I would like to offer you some advice -- don't depend on the JVM's default encoding. It's always best to explicitly specify the desired encoding (i.e. "UTF-8") in your code. That way, you know it will work even across different systems and JVM configurations.

Marc Novakowski 2008-12-12 05:36:04

Except, of course, if you're writing a desktop app and processing some user-specified text that does not have any encoding metadata - then the platform default encoding is your best guess as to what the user might be using.

Michael Borgwardt 2009-03-11 15:29:08

That's a good point - I guess I'm used to writing server-side Java code. :)

Marc Novakowski 2009-03-11 23:15:13

Answer 2

+3 A:

I think a better approach than setting the platform's default character set, especially as you seem to have restrictions on affecting the application deployment, let alone the platform, is to call the much safer String.getBytes("charsetName"). That way your application is not dependent on things beyond its control.

I personally feel that String.getBytes() should be deprecated, as it has caused serious problems in a number of cases I have seen, where the developer did not account for the default charset possibly changing.

Dov Wasserman 2008-12-12 05:39:49

Answer 3

A:

Excellent comments guys - and things I was already thinking myself. Unfortunately there is an underlying String.getBytes() call that I have no control over. The only way I currently see to get around it is to set the default encoding programmatically.

Any other suggestions?

2008-12-12 05:49:15

Every work around on charset related issue in Java cause data lost and huge performance lost. I'd suggest you try your best to fix/take control the underlying getBytes.The implementation of String is quite JRE vendor and version specific; you will not able to ensure the behavior same on all JRE.

Dennis Cheung 2008-12-16 04:53:13

Answer 4

+11 A:

Unfortunately, the file.encoding property has to be specified as the JVM starts up; by the time your main method is entered, the character encoding used by String.getBytes() and the default constructors of InputStreamReader and OutputStreamWriter has been permanently cached.

Charset.defaultCharset() will reflect changes to the file.encoding property, but most of the code in the core Java libraries that need to determine the default character encoding do not use this mechanism.

When you are encoding or decoding, you can query the file.encoding property or Charset.defaultCharset() to find the current default encoding, and use the appropriate method or constructor overload to specify it.

erickson 2008-12-12 05:56:25

Answer 5

+1 A:

Are you sure you cannot set the encoding at JVM start-up time?

If not, can you subclass the offending class to fix the call to getBytes()? Or maybe re-encode the byte array at a convenient location?

Thilo 2008-12-12 06:05:38

Answer 6

A:

Not clear on what you do and don't have control over at this point. If you can interpose a different OutputStream class on the destination file, you could use a subtype of OutputStream which converts Strings to bytes under a charset you define, say UTF-8 by default. If modified UTF-8 is suffcient for your needs, you can use DataOutputStream.writeUTF(String):

byte inbytes[] = new byte[1024];
FileInputStream fis = new FileInputStream("response.txt");
fis.read(inbytes);
String in = new String(inbytes, "UTF8");
DataOutputStream out = new DataOutputStream(new FileOutputStream("response-2.txt"));
out.writeUTF(in); // no getBytes() here

If this approach is not feasible, it may help if you clarify here exactly what you can and can't control in terms of data flow and execution environment (though I know that's sometimes easier said than determined). Good luck.

Dov Wasserman 2008-12-16 03:59:32

DataInputStream and DataOutputStream are special-purpose classes that should never be used with plain text files. The modified UTF-8 they employ is not compatible with real UTF-8. Besides, if the OP could use your solution, he could also use the right tool for this job: an OutputStreamWriter.

Alan Moore 2008-12-25 04:19:31

Answer 7

+6 A:

From the JVM™ Tool Interface documentation…

Since the command-line cannot always be accessed or modified, for example in embedded VMs or simply VMs launched deep within scripts, a JAVA_TOOL_OPTIONS variable is provided so that agents may be launched in these cases.

By setting the (Windows) environment variable JAVA_TOOL_OPTIONS to -Dfile.encoding=UTF8, the (Java) System property will be set automatically every time a JVM is started. You will know that the parameter has been picked up because the following message will be posted to System.err:

Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8

Edward Grech 2009-03-08 04:31:50

Answer 8

A:

I had the same problem reading UTF-8 files.
I did as follows and UTF-8 files are read properly:

new BufferedReader(
    new InputStreamReader(
        new FileInputStream(file), UTF8));

2009-03-11 15:10:09

Shouldn't UTF8 be in "quotes"?

Michael Myers 2009-03-11 15:15:00

@mmyers: Yes, of course.

sleske 2009-11-10 09:30:33

The bracketing is also incorrect... it's pseudocode :-).

sleske 2009-11-10 09:31:11

Answer 9

A:

-Dfile.encoding is deprecated. You should set the underlying system locale!

Can you give link to support this assertion ? I can't find despite looking.

Guest 2009-10-05 05:51:31

http://bugs.sun.com/view_bug.do?bug_id=4163515

McDowell 2010-08-29 21:26:30

ansaurus

tags:

views:

answers:

Setting the default Java character encoding?

related questions