tags:

views:

210

answers:

4

Hi, I recently discovered that relying on default encoding of JVM causes bugs. I should explicitly use specific encoding ex. UTF-8 while working with String, InputStreams etc. I have a huge codebase to scan for ensuring this. Could somebody suggest me some simpler way to check this than searching the whole codebase.

Thanks Nayn

+4  A: 
System.getProperty("file.encoding")

returns the VM encoding for i/o operations

You can set it by passing -Dfile.encoding=utf-8

Bozho
Please see the thread that i mentioned in the comment. The above property is internal implementation detail for specific JVM implementation. The use of this property is varying in Java 1.5 and 1.6.
Nayn
it isn't. Read the accepted answer fully :) this is a standard setting that determines the default charset.
Bozho
Setting a property like this to correct code is an outrageous hack.
Tom Hawtin - tackline
@Tom I don't share your opinion on that. While it is preferable not to rely on this (and I never do), it is legitimate to use VM parameters.
Bozho
I have to admit that I couldn't solve this problem without setting system property as -Dfile.encoding=utf-8. I tried every possible approach to put encoding wherever possible.
Nayn
+3  A: 

Not a direct answer, but to ease the job it's good to know that in a bit decent IDE you can just search for used occurrences of InputStreamReader, OutputStreamWriter, String#getBytes(), String(byte[]), Properties#load(), URLEncoder#encode(), URLDecoder#decode() and consorts wherein you could pass the charset and then update accordingly. You'd also like to search for FileReader and FileWriter and replace them by the first two mentioned classes. True, it's a tedious task, but worth it and I'd prefer it above relying on enrivonmental specifics.

In Eclipse for example, select the project(s) of interest, hit Ctrl+H, switch to tab Java Search, enter for example InputStreamReader, tick the Search For option Constructor, choose Sources as the only Search In option, and execute the search.

BalusC
+1 good to mention the `InputStreamReader` and the likes.
Bozho
`FileReader` is the baddy. I don't know of a comprehensive list of these dangerous API methods/constructors.
Tom Hawtin - tackline
A: 

relying on default encoding of JVM causes bugs

Indeed, one should always specify the charset when encoding/decoding.

If you are satisfied a default global charset for all you encoding/decoding (not always enough), you can live with Bozho's answer : specify a known fixed default in your JVM arguments or in some static initializer.

But it's good practice to search all implicit charset specifications in your code, and replace them with a explicit charset encoding: some typical methods/classes to look at: FileWriter, FileReader, InputStreamReader, OutputStreamWriter, String#getBytes(), String(byte[]).

leonbloy
Noted should be that `FileWriter` and `FileReader` can't be changed to take a specified encoding. They should be replaced with `OutputStreamWriter` and `InputStreamReader` respectively.
BalusC
A: 

If the file is manipulated by native tools on the servers may want to set the encoding to System.getProperty("file.encoding"). I have run into bugs both ways.

Best practice is to know which character set is used, and set that. Also if the file is used to interface to another application, you should define the character set used. This may be a windows code page or a different UTF format.

BillThor