views:

51

answers:

2

Trying to open a file it states it cannot be found, due to a charset mismatch, when file names have accents. I work using UTF-8 on a linux system (/etc/locales sets UTF-8 as well). Running jboss with -Dfile.encoding=UTF-8 and environment variable JBOSS_ENCODING="UTF-8"

With a JSP I am getting the name of the file :

String fileName = element.getChildText("FileName");
out.println("File to be opened : " + filename);

Displays :

File to be opened : aaaaaà.txt

But, a new File(fileName) won't work. Just file.exists() is false.

Trying to:

File[] files = dir.listFiles();
for (int i=0; i<files.length; i++){
      out.println(fileName);

I get : aaaaaà .txt

Why is it reading and trying to open the file taking of the file in HDD as ISO-8859-1? Is it a JBoss config? A java config? How can I force java.io.File to read the file using the UTF-8 as the charset of the file name?

I've used other tools and the name is always read fine, using UTF-8.

(note I'm always talking about the name of the file, never the content, it could be a void file)

+1  A: 

Try this:

http://stackoverflow.com/questions/1545625/java-cant-open-a-file-with-surrogate-unicode-values-in-the-filename

Steve Perkins
It seems no solution was given in that post, but, answering the questions there, the answer in my case is always UTF-8
Llistes Sugra
+1  A: 

I am trying to track down the problem. Here is what I already have:

There is Exists.java:

import java.io.*;

public class Exists {
  public static void main(String[] args) {
    new File("aaa").exists();
    new File("aaa\u00E4").exists();
    new File("aaa\u00C3\u00A4").exists();
  }
}

And there is java -version:

java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)

Now to the interesting part:

$ strace -f -o strace.out java Exists && grep 'stat("aaa' strace.out
31942 stat("aaa", 0x41464950)           = -1 ENOENT (No such file or directory)
31942 stat("aaa\303\244", 0x41464950)   = -1 ENOENT (No such file or directory)
31942 stat("aaa\303\203\302\244", 0x41464950) = -1 ENOENT (No such file or directory)

The nice thing is that strace works on byte-level, not character-level like Java. So everything is ok in this case. I have the environment variable LANG set to en_US.UTF-8, all of the LC_* variables are unset.

Now tracking down the problem to a minimal working example:

$ strace -f -o strace.out env - LC_ALL=en_US.UTF-8 /home/roland/bin/java Exists && grep 'stat("aaa' strace.out
31968 stat("aaa", 0x41a75950)           = -1 ENOENT (No such file or directory)
31968 stat("aaa\303\244", 0x41a75950)   = -1 ENOENT (No such file or directory)
31968 stat("aaa\303\203\302\244", 0x41a75950) = -1 ENOENT (No such file or directory)

That still works. So let's try another encoding:

$ strace -f -o strace.out env - LANG=en_US.ISO-8859-1 /home/roland/bin/java Exists && grep 'stat("aaa' strace.out
32070 stat("aaa", 0x407a3950)           = -1 ENOENT (No such file or directory)
32070 stat("aaa?", 0x407a3950)          = -1 ENOENT (No such file or directory)
32070 stat("aaa??", 0x407a3950)         = -1 ENOENT (No such file or directory)

So this doesn't work. One possible reason might be that I selected a locale that is not in the list printed by locale -a. But this shouldn't be the reason for Java to convert the letters to question marks.

As soon as LANG points to a non-existing locale, the setting of the sun.jnu.encoding property doesn't have any effect anymore. So I'm out of ideas now.

Roland Illig
Question mark is supposed to be displayed when trying to encode an ISO with UTF-8. It seems you are doing the opposite, so it should write something like "÷". I guess this is a console issue consisting in writing in UTF (again) something strace converted to ISO.
Llistes Sugra
No, it isn't. Why should the UTF-8 bytes be displayed as octal escapes and the latin1 ones not? As I said, `strace` works on byte-level. Otherwise it would be useless for binary data.
Roland Illig