views:

3253

answers:

4

The obvious answer is to use Charset.defaultCharset() but we recently found out that this might not be the right answer. I was told that the result is different from real default charset used by java.io classes in several occasions. Looks like Java keeps 2 sets of default charset. Does anyone have any insights on this issue?

We were able to reproduce one fail case. It's kind of user error but it may still expose the root cause of all other problems. Here is the code,

public class CharSetTest {

    public static void main(String[] args) {
     System.out.println("Default Charset=" + Charset.defaultCharset());
     System.setProperty("file.encoding", "Latin-1");
     System.out.println("file.encoding=" + System.getProperty("file.encoding"));
     System.out.println("Default Charset=" + Charset.defaultCharset());
     System.out.println("Default Charset in Use=" + getDefaultCharSet());
    }

    private static String getDefaultCharSet() {
     OutputStreamWriter writer = new OutputStreamWriter(new ByteArrayOutputStream());
     String enc = writer.getEncoding();
     return enc;
    }
}

Our server requires default charset in Latin-1 to deal with some mixed encoding (ANSI/Latin-1/UTF-8) in a legacy protocol. So all our servers run with this JVM parameter,

-Dfile.encoding=ISO-8859-1

Here is the result on Java 5,

Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=UTF-8
Default Charset in Use=ISO8859_1

Someone tries to change the encoding runtime by setting the file.encoding in the code. We all know that doesn't work. However, this apparently throws off defaultCharset() but it doesn't affect the real default charset used by OutputStreamWriter.

Is this a bug or feature?

EDIT: The accepted answer shows the root cause of the issue. Basically, you can't trust defaultCharset() in Java 5, which is not the default encoding used by I/O classes. Looks like Java 6 corrects this issue.

+5  A: 

This is really strange... Once set, the default Charset is cached and it isn't changed while the class is in memory. Setting the "file.encoding" property with System.setProperty("file.encoding", "Latin-1"); does nothing. Every time Charset.defaultCharset() is called it returns the cached charset.

Here are my results:

Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=ISO-8859-1
Default Charset in Use=ISO8859_1

I'm using JVM 1.6 though.

(update)

Ok. I did reproduce your bug with JVM 1.5.

Looking at the source code of 1.5, the cached default charset isn't being set. I don't know if this is a bug or not but 1.6 changes this implementation and uses the cached charset:

JVM 1.5:

public static Charset defaultCharset() {
synchronized (Charset.class) {
    if (defaultCharset == null) {
 java.security.PrivilegedAction pa =
     new GetPropertyAction("file.encoding");
 String csn = (String)AccessController.doPrivileged(pa);
 Charset cs = lookup(csn);
 if (cs != null)
     return cs;
 return forName("UTF-8");
    }
    return defaultCharset;
}
}

JVM 1.6:

public static Charset defaultCharset() {
    if (defaultCharset == null) {
    synchronized (Charset.class) {
 java.security.PrivilegedAction pa =
     new GetPropertyAction("file.encoding");
 String csn = (String)AccessController.doPrivileged(pa);
 Charset cs = lookup(csn);
 if (cs != null)
     defaultCharset = cs;
            else 
     defaultCharset = forName("UTF-8");
        }
}
return defaultCharset;
}

When you set the file encoding to file.encoding=Latin-1 the next time you call Charset.defaultCharset(), what happens is, because the cached default charset isn't set, it will try to find the appropriate charset for the name Latin-1. This name isn't found, because it's incorrect, and returns the default UTF-8.

As for why the IO classes such as OutputStreamWriter return an unexpected result,
the implementation of sun.nio.cs.StreamEncoder (witch is used by these IO classes) is different as well for JVM 1.5 and JVM 1.6. The JVM 1.6 implementation is based in the Charset.defaultCharset() method to get the default encoding, if one is not provided to IO classes. The JVM 1.5 implementation uses a different method Converters.getDefaultEncodingName(); to get the default charset. This method uses it's own cache of the default charset that is set upon JVM initialization:

JVM 1.6:

   public static StreamEncoder forOutputStreamWriter(OutputStream out,
                                                     Object lock,
                                                     String charsetName)
       throws UnsupportedEncodingException
   {
       String csn = charsetName;
       if (csn == null)
           csn = Charset.defaultCharset().name();
       try {
           if (Charset.isSupported(csn))
               return new StreamEncoder(out, lock, Charset.forName(csn));
       } catch (IllegalCharsetNameException x) { }
       throw new UnsupportedEncodingException (csn);
   }

JVM 1.5:

public static StreamEncoder forOutputStreamWriter(OutputStream out,
           Object lock,
           String charsetName)
throws UnsupportedEncodingException
{
String csn = charsetName;
if (csn == null)
    csn = Converters.getDefaultEncodingName();
if (!Converters.isCached(Converters.CHAR_TO_BYTE, csn)) {
    try {
 if (Charset.isSupported(csn))
     return new CharsetSE(out, lock, Charset.forName(csn));
    } catch (IllegalCharsetNameException x) { }
}
return new ConverterSE(out, lock, csn);
}

But I agree with the comments. You shouldn't rely on this property. It's an implementation detail.

bruno conde
To reproduce this error, you must be on Java 5 and your JRE default encoding must be UTF-8.
ZZ Coder
This is writing to the implementation, not the abstraction. If you rely on undocumented stuff, don't be surprised if your code breaks when you upgrade to a newer version of the platform.
McDowell
+3  A: 

Is this a bug or feature?

Looks like undefined behaviour. I know that, in practice, you can change the default encoding using a command-line property, but I don't think what happens when you do this is defined.

Bug ID: 4153515 on problems setting this property:

This is not a bug. The "file.encoding" property is not required by the J2SE platform specification; it's an internal detail of Sun's implementations and should not be examined or modified by user code. It's also intended to be read-only; it's technically impossible to support the setting of this property to arbitrary values on the command line or at any other time during program execution.

The preferred way to change the default encoding used by the VM and the runtime system is to change the locale of the underlying platform before starting your Java program.

I cringe when I see people setting the encoding on the command line - you don't know what code that is going to affect.

If you do not want to use the default encoding, set the encoding you do want explicitly via the appropriate method/constructor.

McDowell
+2  A: 

First, Latin-1 is the same as ISO-8859-1, so, the default was already OK for you. Right?

You successfully set the encoding to ISO-8859-1 with your command line parameter. You also set it programmatically to "Latin-1", but, that's not a recognized value of a file encoding for Java. See http://java.sun.com/javase/6/docs/technotes/guides/intl/encoding.doc.html

When you do that, looks like Charset resets to UTF-8, from looking at the source. That at least explains most of the behavior.

I don't know why OutputStreamWriter shows ISO8859_1. It delegates to closed-source sun.misc.* classes. I'm guessing it isn't quite dealing with encoding via the same mechanism, which is weird.

But of course you should always be specifying what encoding you mean in this code. I'd never rely on the platform default.

Sean Owen
+2  A: 

The behaviour is not really that strange. Looking into the implementation of the classes, it is caused by:

  • Charset.defaultCharset() is not caching the determined character set in Java 5.
  • Setting the system property "file.encoding" and invoking Charset.defaultCharset() again causes a second evaluation of the system property, no character set with the name "Latin-1" is found, so Charset.defaultCharset defaults to "UTF-8".
  • The OutputStreamWriter is however caching the default character set and is probably used already during VM initialization, so that its default character set diverts from Charset.defaultCharset() if the system property "file.encoding" has been changed at runtime.

As already pointed out, it is not documented how the VM must behave in such a situation. The Charset.defaultCharset() API documentation is not very precise on how the default character set is determined, only mentioning that it is usually done on VM startup, based on factors like the OS default character set or default locale.

jarnbjo