ansaurus

Question

How to Find Default Charset/Encoding in Java?

Answer 1

+5 A:

This is really strange... Once set, the default Charset is cached and it isn't changed while the class is in memory. Setting the "file.encoding" property with System.setProperty("file.encoding", "Latin-1"); does nothing. Every time Charset.defaultCharset() is called it returns the cached charset.

Here are my results:

Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=ISO-8859-1
Default Charset in Use=ISO8859_1

I'm using JVM 1.6 though.

(update)

Ok. I did reproduce your bug with JVM 1.5.

Looking at the source code of 1.5, the cached default charset isn't being set. I don't know if this is a bug or not but 1.6 changes this implementation and uses the cached charset:

JVM 1.5:

public static Charset defaultCharset() {
synchronized (Charset.class) {
    if (defaultCharset == null) {
 java.security.PrivilegedAction pa =
     new GetPropertyAction("file.encoding");
 String csn = (String)AccessController.doPrivileged(pa);
 Charset cs = lookup(csn);
 if (cs != null)
     return cs;
 return forName("UTF-8");
    }
    return defaultCharset;
}
}

JVM 1.6:

public static Charset defaultCharset() {
    if (defaultCharset == null) {
    synchronized (Charset.class) {
 java.security.PrivilegedAction pa =
     new GetPropertyAction("file.encoding");
 String csn = (String)AccessController.doPrivileged(pa);
 Charset cs = lookup(csn);
 if (cs != null)
     defaultCharset = cs;
            else 
     defaultCharset = forName("UTF-8");
        }
}
return defaultCharset;
}

When you set the file encoding to file.encoding=Latin-1 the next time you call Charset.defaultCharset(), what happens is, because the cached default charset isn't set, it will try to find the appropriate charset for the name Latin-1. This name isn't found, because it's incorrect, and returns the default UTF-8.

As for why the IO classes such as OutputStreamWriter return an unexpected result,
the implementation of sun.nio.cs.StreamEncoder (witch is used by these IO classes) is different as well for JVM 1.5 and JVM 1.6. The JVM 1.6 implementation is based in the Charset.defaultCharset() method to get the default encoding, if one is not provided to IO classes. The JVM 1.5 implementation uses a different method Converters.getDefaultEncodingName(); to get the default charset. This method uses it's own cache of the default charset that is set upon JVM initialization:

JVM 1.6:

   public static StreamEncoder forOutputStreamWriter(OutputStream out,
                                                     Object lock,
                                                     String charsetName)
       throws UnsupportedEncodingException
   {
       String csn = charsetName;
       if (csn == null)
           csn = Charset.defaultCharset().name();
       try {
           if (Charset.isSupported(csn))
               return new StreamEncoder(out, lock, Charset.forName(csn));
       } catch (IllegalCharsetNameException x) { }
       throw new UnsupportedEncodingException (csn);
   }

JVM 1.5:

public static StreamEncoder forOutputStreamWriter(OutputStream out,
           Object lock,
           String charsetName)
throws UnsupportedEncodingException
{
String csn = charsetName;
if (csn == null)
    csn = Converters.getDefaultEncodingName();
if (!Converters.isCached(Converters.CHAR_TO_BYTE, csn)) {
    try {
 if (Charset.isSupported(csn))
     return new CharsetSE(out, lock, Charset.forName(csn));
    } catch (IllegalCharsetNameException x) { }
}
return new ConverterSE(out, lock, csn);
}

But I agree with the comments. You shouldn't rely on this property. It's an implementation detail.

bruno conde 2009-11-17 14:25:00

To reproduce this error, you must be on Java 5 and your JRE default encoding must be UTF-8.

ZZ Coder 2009-11-17 14:27:30

This is writing to the implementation, not the abstraction. If you rely on undocumented stuff, don't be surprised if your code breaks when you upgrade to a newer version of the platform.

McDowell 2009-11-17 14:35:27

Answer 2

+3 A:

Is this a bug or feature?

Looks like undefined behaviour. I know that, in practice, you can change the default encoding using a command-line property, but I don't think what happens when you do this is defined.

Bug ID: 4153515 on problems setting this property:

This is not a bug. The "file.encoding" property is not required by the J2SE platform specification; it's an internal detail of Sun's implementations and should not be examined or modified by user code. It's also intended to be read-only; it's technically impossible to support the setting of this property to arbitrary values on the command line or at any other time during program execution.

The preferred way to change the default encoding used by the VM and the runtime system is to change the locale of the underlying platform before starting your Java program.

I cringe when I see people setting the encoding on the command line - you don't know what code that is going to affect.

If you do not want to use the default encoding, set the encoding you do want explicitly via the appropriate method/constructor.

McDowell 2009-11-17 14:30:01

Answer 3

+2 A:

First, Latin-1 is the same as ISO-8859-1, so, the default was already OK for you. Right?

You successfully set the encoding to ISO-8859-1 with your command line parameter. You also set it programmatically to "Latin-1", but, that's not a recognized value of a file encoding for Java. See http://java.sun.com/javase/6/docs/technotes/guides/intl/encoding.doc.html

When you do that, looks like Charset resets to UTF-8, from looking at the source. That at least explains most of the behavior.

I don't know why OutputStreamWriter shows ISO8859_1. It delegates to closed-source sun.misc.* classes. I'm guessing it isn't quite dealing with encoding via the same mechanism, which is weird.

But of course you should always be specifying what encoding you mean in this code. I'd never rely on the platform default.

Sean Owen 2009-11-17 14:30:11

Answer 4

+2 A:

The behaviour is not really that strange. Looking into the implementation of the classes, it is caused by:

Charset.defaultCharset() is not caching the determined character set in Java 5.
Setting the system property "file.encoding" and invoking Charset.defaultCharset() again causes a second evaluation of the system property, no character set with the name "Latin-1" is found, so Charset.defaultCharset defaults to "UTF-8".
The OutputStreamWriter is however caching the default character set and is probably used already during VM initialization, so that its default character set diverts from Charset.defaultCharset() if the system property "file.encoding" has been changed at runtime.

As already pointed out, it is not documented how the VM must behave in such a situation. The Charset.defaultCharset() API documentation is not very precise on how the default character set is determined, only mentioning that it is usually done on VM startup, based on factors like the OS default character set or default locale.

jarnbjo 2009-11-17 15:21:22

ansaurus

tags:

views:

answers:

How to Find Default Charset/Encoding in Java?

related questions