views:

178

answers:

4

I'm struggling with a strange file name encoding issue when listing directory contents in Java 6 on both OS X and Linux: the File.listFiles() and related methods seem to return file names in a different encoding than the rest of the system.

Note that it is not merely the display of these file names that is causing me problems. I'm mainly interested in doing a comparison of file names with a remote file storage system, so I care more about the content of the name strings than the character encoding used to print output.

Here is a program to demonstrate. It creates a file with a Unicode name then prints out URL-encoded versions of the file names obtained from the directly-created File, and the same file when listed under a parent directory (you should run this code in an empty directory). The results show the different encoding returned by the File.listFiles() method.

String fileName = "Trîcky Nåme";
File file = new File(fileName);
file.createNewFile();
System.out.println("File name: " + URLEncoder.encode(file.getName(), "UTF-8"));

// Get parent (current) dir and list file contents
File parentDir = file.getAbsoluteFile().getParentFile();
File[] children = parentDir.listFiles();
for (File child: children) {
    System.out.println("Listed name: " + URLEncoder.encode(child.getName(), "UTF-8"));
}

Here's what I get when I run this test code on my systems. Note the %CC versus %C3 character representations.

OS X Snow Leopard:

File name: Tri%CC%82cky+Na%CC%8Ame
Listed name: Tr%C3%AEcky+N%C3%A5me

$ java -version
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02-279-10M3065)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01-279, mixed mode)

KUbuntu Linux (running in a VM on same OS X system):

File name: Tri%CC%82cky+Na%CC%8Ame
Listed name: Tr%C3%AEcky+N%C3%A5me

$ java -version
java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8.1) (6b18-1.8.1-0ubuntu1)
OpenJDK Client VM (build 16.0-b13, mixed mode, sharing)

I have tried various hacks to get the strings to agree, including setting the file.encoding system property and various LC_CTYPE and LANG environment variables. Nothing helps, nor do I want to resort to such hacks.

Unlike this (somewhat related?) question, I am able to read data from the listed files despite the odd names: http://stackoverflow.com/questions/2423781/chinese-encoding-issue-while-listing-files

I'm out of ideas, and after many hours of fruitless debugging and Googling I'm about ready for some enlightenment.

Solution

Thanks to Stephen P for putting me on the right track.

The fix first, for the impatient. If you are compiling with Java 6 you can use the java.text.Normalizer class to normalize strings into a common form of your choice, e.g.

// Normalize to "Normalization Form Canonical Decomposition" (NFD)
protected String normalizeUnicode(String str) {
    Normalizer.Form form = Normalizer.Form.NFD;
    if (!Normalizer.isNormalized(str, form)) {
        return Normalizer.normalize(str, form);
    }
    return str;
}

Since java.text.Normalizer is only available in Java 6 and later, if you need to compile with Java 5 you might have to resort to the sun.text.Normalizer implementation and something like this reflection-based hack See also http://stackoverflow.com/questions/1279910

This alone is enough for me to decide I won't support compilation of my project with Java 5 :|

Here are other interesting things I learned in this sordid adventure.

  • The confusion is caused by the file names being in one of two normalization forms which cannot be directly compared: Normalization Form Canonical Decomposition (NFD) or Normalization Form Canonical Composition (NFC). The former tends to have ASCII letters followed by "modifiers" to add accents etc, while the latter has only the extended characters with no ACSCII leading character. Read the wiki page Stephen P references for a better explanation.

  • Unicode string literals like the one contained in the example code (and those received via HTTP in my real app) are in the NFD form, while file names returned by the File.listFiles() method are NFC. The following mini-example demonstrates the differences:

    String name = "Trîcky Nåme";
    System.out.println("Original name: " + URLEncoder.encode(name, "UTF-8"));
    System.out.println("NFC Normalized name: " + URLEncoder.encode(
        Normalizer.normalize(name, Normalizer.Form.NFC), "UTF-8"));
    System.out.println("NFD Normalized name: " + URLEncoder.encode(
        Normalizer.normalize(name, Normalizer.Form.NFD), "UTF-8"));
    

    Output:

    Original name: Tri%CC%82cky+Na%CC%8Ame
    NFC Normalized name: Tr%C3%AEcky+N%C3%A5me
    NFD Normalized name: Tri%CC%82cky+Na%CC%8Ame
    
  • If you construct a File object with a string name, the File.getName() method will return the name in whatever form you gave it originally. However, if you call File methods that discover names on their own, they seem to return names in NFC form. This is a potentially a nasty gotcha. It certainly gotchme.

  • According to the quote below from Apple's documentation file names are stored in decomposed (NFD) form on the HFS Plus file system:

    When working within Mac OS you will find yourself using a mixture of precomposed and decomposed Unicode. For example, HFS Plus converts all file names to decomposed Unicode, while Macintosh keyboards generally produce precomposed Unicode.

    So the File.listFiles() method helpfully (?) converts file names to the (pre)composed (NFC) form.

A: 

I've seen something similar before. People that uploadde files from their Mac to a webapp used filenames with é.

a) In OS that char is normal e + "sign for ´ applied to the previous char"

b) In Windows it's a special char: é

Both are Unicode. So... I understand you pass the (b) option to File create and at some point Mac OS converts it to the (a) option. Maybe if you find the double representation issue over the internet you can get a way to handle both situations successfully.

Hope it helps!

helios
Well, in fact, it's happening the opposite. You type the Java file [original name] with your Mac keyboard the (a) option and the system converts it [at file creation time] to the (b) option.
helios
Thanks, that's exactly the right track. I added a discussion to my question that expands a little on the different Unicode forms and when you get each kind from `File` methods
James Murty
A: 

On Unix file-system, a file name really is a null-terminated byte[]. So the java runtime has to perform conversion from java.lang.String to byte[] during the createNewFile() operation. The char-to-byte conversion is governed by the locale. I've been testing setting LC_ALL to en_US.UTF-8 and en_US.ISO-8859-1 and got coherent results. This is with Sun (...Oracle) java 1.6.0_20. However, For LC_ALL=en_US.POSIX, the result is:

File name:   Tr%C3%AEcky+N%C3%A5me
Listed name: Tr%3Fcky+N%3Fme

3F is a question mark. It tells me that the conversion was not successful for the non-ASCII character. Then again, everything is as expected.

But the reason why your two strings are different is because of the equivalence between the \u00EE character (or C3 AE in UTF-8) and the sequence i+\u0302 (69 CC 82 in UTF-8). \u0302 is a combining diacritical mark (combining circumflex accent). Some sort of normalization occurred during the file creation. I'm not sure if it's done in the Java run-time or the OS.

NOTE: I took me some time to figure it out since the code snippet that you've posted do not have a combining diacritical mark but the equivalent character î (e.g. \u00ee). You should have embedded the Unicode escape sequence in the string literal (but it's easy to say that afterward...).

gawi
Your point about the equivalence of two different Unicode forms is exactly right. Interestingly, if I had included the explicit \u code for a combined/composed diacritical in the example it would have obscured the differences I was seeing in my real app. A case where doing it wrong worked out well
James Murty
A: 

I suspect that you just have to instruct javac what encoding to use to compile the .java file containing the special characters with since you've hardcoded it in the source file. Otherwise the platform default encoding will be used, which may not be UTF-8 at all.

You can use the VM argument -encoding for this.

javac -encoding UTF-8 com/example/Foo.java

This way the resulting .class file will end up containing the correct characters and you will be able to create and list the correct filename as well.

BalusC
Although the Unicode file name is hard-coded into the sample program, my real program sources data from the file system or a web service. No hard-coded strings are involved. I did try the `-encoding` option with my example code but it didn't make a difference there either.
James Murty
+3  A: 

Using Unicode, there is more than one valid way to represent the same letter. The characters you're using in your Tricky Name are a "latin small letter i with circumflex" and a "latin small letter a with ring above".

You say "Note the %CC versus %C3 character representations", but looking closer what you see are the sequences

i\uCC82 vs. \uC3AE
a\uCC8A vs. \uC3A5

That is, the first is letter i followed by 0xCC82 the "combining circumflex accent" character while the second is "latin small letter i with circumflex". Similarly for the other pair, the first is the letter a followed by 0xCC8A the "combining ring above" character and the second is "latin small letter a with ring above". Both of these are valid UTF-8 encodings of valid Unicode character strings, but one is in "composed" and the other in "decomposed" format.

OS X HFS Plus volumes store strings (e.g. filenames) as "fully decomposed". A Unix file-system is really stored according to how the filesystem driver chooses to store it. You can't make any blanket statements across different types of filesystems.

See the Wikipedia article on Unicode Equivalence for general discussion of composed vs decomposed forms, which mentions OS X specifically.

See Apple's Tech Q&A QA1235 (in Objective-C unfortunately) for information on converting forms.

A recent email thread on Apple's java-dev mailing list could be of some help to you.

Basically, you need to normalize the decomposed form into a composed form before you can compare the strings.

Stephen P
Thanks for this great answer, it put me on the right track. See amended question with a summary of what I learned, and the specific solution (tl;dr -- use java.text.Normalizer)
James Murty
I faced this kind of problem before. It's good to know the general theory behind it. Thanks!
helios