views:

918

answers:

6

I have a problem where I can't write files with accents in the file name on Solaris.

Given following code

public static void main(String[] args) {
    System.out.println("Charset = "+ Charset.defaultCharset().toString());
    System.out.println("testéörtkuoë");
    FileWriter fw = null;
    try {
        fw  = new FileWriter("testéörtkuoë");
        fw.write("testéörtkuoëéörtkuoë");
        fw.close();

I get following output

Charset = ISO-8859-1
test??rtkuo?

and I get a file called "test??rtkuo?"

Based on info I found on StackOverflow, I tried to call the Java app by adding "-Dfile.encoding=UTF-8" at startup. This returns following output

Charset = UTF-8
testéörtkuoë

But the filename is still "test??rtkuo?"

Any help is much appreciated.

Stef

+3  A: 

Do you get the same problem if you use unicode literals (\uXXXX) instead of having unicode in the actual source file?

Does the filesystem definitely support UTF-8 file names? Does the tool you're using to view the file on the filesystem (ls?) support them?

sk
yes, the problem also occurs when using unicode literals. I view the files using "ls" from the shell. Existing file names with accents (created on a shared volume from another machine) can be shown using "ls". However, in the shell I cannot type or copy/past a character with an accent. So it may be related to some Solaris settings
+1  A: 

If you attempt to list the filenames with the java io apis, what do you see? Are they encoded correctly? I'm curious as to whether the real problem is with encoding the filenames or with the tools that you are using to check them.

jsight
the problem also occurs when I list the file names using the API. The tool I use to show the file is a simple "ls" from the shell. Using "ls" I can see other files that do have accents. However, in the shell I cannot type accents with characters, so this may be a problem with my Solaris environment
A: 

What happens when you do:

ls > testéörtkuoë

If that works (writes to the file correctly), then you know you can write to files with accents.

Elijah
this does not work because I cannot type characters with accents in my Solaris shell. I can see other files with accents if I do an ls however. This may be a problem with the setting of the Solaris environment
Can you paste the filename into the shell?
Elijah
no, this does not work either. The accented characters are skipped when pasting
To help you solve your problem, could you please tell us what is the name of the shell you are using and what type of filesystem you are writing to?
Elijah
+2  A: 

All these characters are present in ISO-8859-1. I suspect part of the problem is that the code editor is saving files in a different encoding to the one your operating system is using.

If the editor is using ISO-8859-1, I would expect it to encode ëéö as:

eb e9 f6

If the editor is using UTF-8, I would expect it to encode ëéö as:

c3ab c3a9 c3b6

Other encodings will produce different values.

The source file would be more portable if you used Unicode escape sequences. At least be certain your compiler is using the same encoding as the editor.

Examples:

ë    \u00EB
é    \u00E9
ö    \u00F6

You can look up these values using the Unicode charts.

Changing the default file encoding using -Dfile.encoding=UTF-8 might have unintended consequences for how the JVM interacts with the system.

There are parallels here with problems you might see on Windows.

I'm unable to reproduce the problem directly - my version of OpenSolaris uses UTF-8 as the default encoding.

McDowell
A: 

I got a similar problem. Contrary to that example, the program was unable to list the files correct using sysout.println, despite the ls was showing correct values.

As described in the documentation, the enviroment variable file.encoding should not be used to define charset and, in this case, the JVM ignores it

The symptom:

  1. I could not type accents in shell.
  2. ls was showing correct values
  3. File.list() was printing incorrect values
  4. the environ file.encoding was not affecting the output
  5. the environ user.(language|country) was not affecting the output

The solution:

Although the enviroment variable LC_* was set in the shell with values inherited from /etc/defaut/init, as listed by set command, the locale showed different values.

$ set | grep LC
LC_ALL=pt_BR.ISO8859-1
LC_COLLATE=pt_BR.ISO8859-1
LC_CTYPE=pt_BR.ISO8859-1
LC_MESSAGES=C
LC_MONETARY=pt_BR.ISO8859-1
LC_NUMERIC=pt_BR.ISO8859-1
LC_TIME=pt_BR.ISO8859-1

$ locale
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=

The solution was simple exporting LANG. This environment variable really affect the jvm

LANG=pt_BR.ISO8859-1
export LANG
monzeu
A: 

Java uses operating system's default encoding while reading and writing files. Now, one should never rely on that. It's always a good practice to specify the encoding explicitly.

In Java you can use following for reading and writing:

Reading:

BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(inputPath),"UTF-8"));

Writing:

PrintWriter pw = new PrintWriter(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputPath), "UTF-8")));
mohitsoni