ansaurus

Question

File name charset problem in java

Answer 1

+1 A:

Try this:

http://stackoverflow.com/questions/1545625/java-cant-open-a-file-with-surrogate-unicode-values-in-the-filename

Steve Perkins 2010-09-30 16:38:25

It seems no solution was given in that post, but, answering the questions there, the answer in my case is always UTF-8

Llistes Sugra 2010-09-30 18:15:20

Answer 2

+1 A:

I am trying to track down the problem. Here is what I already have:

There is Exists.java:

import java.io.*;

public class Exists {
  public static void main(String[] args) {
    new File("aaa").exists();
    new File("aaa\u00E4").exists();
    new File("aaa\u00C3\u00A4").exists();
  }
}

And there is java -version:

java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)

Now to the interesting part:

$ strace -f -o strace.out java Exists && grep 'stat("aaa' strace.out
31942 stat("aaa", 0x41464950)           = -1 ENOENT (No such file or directory)
31942 stat("aaa\303\244", 0x41464950)   = -1 ENOENT (No such file or directory)
31942 stat("aaa\303\203\302\244", 0x41464950) = -1 ENOENT (No such file or directory)

The nice thing is that strace works on byte-level, not character-level like Java. So everything is ok in this case. I have the environment variable LANG set to en_US.UTF-8, all of the LC_* variables are unset.

Now tracking down the problem to a minimal working example:

$ strace -f -o strace.out env - LC_ALL=en_US.UTF-8 /home/roland/bin/java Exists && grep 'stat("aaa' strace.out
31968 stat("aaa", 0x41a75950)           = -1 ENOENT (No such file or directory)
31968 stat("aaa\303\244", 0x41a75950)   = -1 ENOENT (No such file or directory)
31968 stat("aaa\303\203\302\244", 0x41a75950) = -1 ENOENT (No such file or directory)

That still works. So let's try another encoding:

$ strace -f -o strace.out env - LANG=en_US.ISO-8859-1 /home/roland/bin/java Exists && grep 'stat("aaa' strace.out
32070 stat("aaa", 0x407a3950)           = -1 ENOENT (No such file or directory)
32070 stat("aaa?", 0x407a3950)          = -1 ENOENT (No such file or directory)
32070 stat("aaa??", 0x407a3950)         = -1 ENOENT (No such file or directory)

So this doesn't work. One possible reason might be that I selected a locale that is not in the list printed by locale -a. But this shouldn't be the reason for Java to convert the letters to question marks.

As soon as LANG points to a non-existing locale, the setting of the sun.jnu.encoding property doesn't have any effect anymore. So I'm out of ideas now.

Roland Illig 2010-09-30 22:10:48

Question mark is supposed to be displayed when trying to encode an ISO with UTF-8. It seems you are doing the opposite, so it should write something like "Ã·". I guess this is a console issue consisting in writing in UTF (again) something strace converted to ISO.

Llistes Sugra 2010-10-01 10:52:29

No, it isn't. Why should the UTF-8 bytes be displayed as octal escapes and the latin1 ones not? As I said, `strace` works on byte-level. Otherwise it would be useless for binary data.

Roland Illig 2010-10-01 23:33:52

ansaurus

tags:

views:

answers:

File name charset problem in java

related questions