I'm struggling with a strange file name encoding issue when listing directory contents in Java 6 on both OS X and Linux: the File.listFiles()
and related methods seem to return file names in a different encoding than the rest of the system.
Note that it is not merely the display of these file names that is causing me problems. I'm mainly interested in doing a comparison of file names with a remote file storage system, so I care more about the content of the name strings than the character encoding used to print output.
Here is a program to demonstrate. It creates a file with a Unicode name then prints out URL-encoded versions of the file names obtained from the directly-created File, and the same file when listed under a parent directory (you should run this code in an empty directory). The results show the different encoding returned by the File.listFiles()
method.
String fileName = "Trîcky Nåme";
File file = new File(fileName);
file.createNewFile();
System.out.println("File name: " + URLEncoder.encode(file.getName(), "UTF-8"));
// Get parent (current) dir and list file contents
File parentDir = file.getAbsoluteFile().getParentFile();
File[] children = parentDir.listFiles();
for (File child: children) {
System.out.println("Listed name: " + URLEncoder.encode(child.getName(), "UTF-8"));
}
Here's what I get when I run this test code on my systems. Note the %CC
versus %C3
character representations.
OS X Snow Leopard:
File name: Tri%CC%82cky+Na%CC%8Ame
Listed name: Tr%C3%AEcky+N%C3%A5me
$ java -version
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02-279-10M3065)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01-279, mixed mode)
KUbuntu Linux (running in a VM on same OS X system):
File name: Tri%CC%82cky+Na%CC%8Ame
Listed name: Tr%C3%AEcky+N%C3%A5me
$ java -version
java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8.1) (6b18-1.8.1-0ubuntu1)
OpenJDK Client VM (build 16.0-b13, mixed mode, sharing)
I have tried various hacks to get the strings to agree, including setting the file.encoding
system property and various LC_CTYPE
and LANG
environment variables. Nothing helps, nor do I want to resort to such hacks.
Unlike this (somewhat related?) question, I am able to read data from the listed files despite the odd names: http://stackoverflow.com/questions/2423781/chinese-encoding-issue-while-listing-files
I'm out of ideas, and after many hours of fruitless debugging and Googling I'm about ready for some enlightenment.
Solution
Thanks to Stephen P for putting me on the right track.
The fix first, for the impatient. If you are compiling with Java 6 you can use the java.text.Normalizer class to normalize strings into a common form of your choice, e.g.
// Normalize to "Normalization Form Canonical Decomposition" (NFD)
protected String normalizeUnicode(String str) {
Normalizer.Form form = Normalizer.Form.NFD;
if (!Normalizer.isNormalized(str, form)) {
return Normalizer.normalize(str, form);
}
return str;
}
Since java.text.Normalizer
is only available in Java 6 and later, if you need to compile with Java 5 you might have to resort to the sun.text.Normalizer
implementation and something like this reflection-based hack See also http://stackoverflow.com/questions/1279910
This alone is enough for me to decide I won't support compilation of my project with Java 5 :|
Here are other interesting things I learned in this sordid adventure.
The confusion is caused by the file names being in one of two normalization forms which cannot be directly compared: Normalization Form Canonical Decomposition (NFD) or Normalization Form Canonical Composition (NFC). The former tends to have ASCII letters followed by "modifiers" to add accents etc, while the latter has only the extended characters with no ACSCII leading character. Read the wiki page Stephen P references for a better explanation.
Unicode string literals like the one contained in the example code (and those received via HTTP in my real app) are in the NFD form, while file names returned by the
File.listFiles()
method are NFC. The following mini-example demonstrates the differences:String name = "Trîcky Nåme"; System.out.println("Original name: " + URLEncoder.encode(name, "UTF-8")); System.out.println("NFC Normalized name: " + URLEncoder.encode( Normalizer.normalize(name, Normalizer.Form.NFC), "UTF-8")); System.out.println("NFD Normalized name: " + URLEncoder.encode( Normalizer.normalize(name, Normalizer.Form.NFD), "UTF-8"));
Output:
Original name: Tri%CC%82cky+Na%CC%8Ame NFC Normalized name: Tr%C3%AEcky+N%C3%A5me NFD Normalized name: Tri%CC%82cky+Na%CC%8Ame
If you construct a
File
object with a string name, theFile.getName()
method will return the name in whatever form you gave it originally. However, if you callFile
methods that discover names on their own, they seem to return names in NFC form. This is a potentially a nasty gotcha. It certainly gotchme.According to the quote below from Apple's documentation file names are stored in decomposed (NFD) form on the HFS Plus file system:
When working within Mac OS you will find yourself using a mixture of precomposed and decomposed Unicode. For example, HFS Plus converts all file names to decomposed Unicode, while Macintosh keyboards generally produce precomposed Unicode.
So the
File.listFiles()
method helpfully (?) converts file names to the (pre)composed (NFC) form.