ansaurus

Question

File.listFiles() mangles unicode names with JDK 6 (Unicode Normalization issues)

Answer 1

A:

I've seen something similar before. People that uploadde files from their Mac to a webapp used filenames with é.

a) In OS that char is normal e + "sign for ´ applied to the previous char"

b) In Windows it's a special char: é

Both are Unicode. So... I understand you pass the (b) option to File create and at some point Mac OS converts it to the (a) option. Maybe if you find the double representation issue over the internet you can get a way to handle both situations successfully.

Hope it helps!

helios 2010-08-31 14:51:17

Well, in fact, it's happening the opposite. You type the Java file [original name] with your Mac keyboard the (a) option and the system converts it [at file creation time] to the (b) option.

helios 2010-08-31 14:53:47

Thanks, that's exactly the right track. I added a discussion to my question that expands a little on the different Unicode forms and when you get each kind from `File` methods

James Murty 2010-09-01 07:11:51

Answer 2

A:

On Unix file-system, a file name really is a null-terminated byte[]. So the java runtime has to perform conversion from java.lang.String to byte[] during the createNewFile() operation. The char-to-byte conversion is governed by the locale. I've been testing setting LC_ALL to en_US.UTF-8 and en_US.ISO-8859-1 and got coherent results. This is with Sun (...Oracle) java 1.6.0_20. However, For LC_ALL=en_US.POSIX, the result is:

File name:   Tr%C3%AEcky+N%C3%A5me
Listed name: Tr%3Fcky+N%3Fme

3F is a question mark. It tells me that the conversion was not successful for the non-ASCII character. Then again, everything is as expected.

But the reason why your two strings are different is because of the equivalence between the \u00EE character (or C3 AE in UTF-8) and the sequence i+\u0302 (69 CC 82 in UTF-8). \u0302 is a combining diacritical mark (combining circumflex accent). Some sort of normalization occurred during the file creation. I'm not sure if it's done in the Java run-time or the OS.

NOTE: I took me some time to figure it out since the code snippet that you've posted do not have a combining diacritical mark but the equivalent character î (e.g. \u00ee). You should have embedded the Unicode escape sequence in the string literal (but it's easy to say that afterward...).

gawi 2010-08-31 16:06:30

Your point about the equivalence of two different Unicode forms is exactly right. Interestingly, if I had included the explicit \u code for a combined/composed diacritical in the example it would have obscured the differences I was seeing in my real app. A case where doing it wrong worked out well

James Murty 2010-09-01 07:09:53

Answer 3

A:

I suspect that you just have to instruct javac what encoding to use to compile the .java file containing the special characters with since you've hardcoded it in the source file. Otherwise the platform default encoding will be used, which may not be UTF-8 at all.

You can use the VM argument -encoding for this.

javac -encoding UTF-8 com/example/Foo.java

This way the resulting .class file will end up containing the correct characters and you will be able to create and list the correct filename as well.

BalusC 2010-08-31 19:34:13

Although the Unicode file name is hard-coded into the sample program, my real program sources data from the file system or a web service. No hard-coded strings are involved. I did try the `-encoding` option with my example code but it didn't make a difference there either.

James Murty 2010-09-01 05:22:00

Answer 4

+3 A:

Using Unicode, there is more than one valid way to represent the same letter. The characters you're using in your Tricky Name are a "latin small letter i with circumflex" and a "latin small letter a with ring above".

You say "Note the %CC versus %C3 character representations", but looking closer what you see are the sequences

i\uCC82 vs. \uC3AE
a\uCC8A vs. \uC3A5

That is, the first is letter i followed by 0xCC82 the "combining circumflex accent" character while the second is "latin small letter i with circumflex". Similarly for the other pair, the first is the letter a followed by 0xCC8A the "combining ring above" character and the second is "latin small letter a with ring above". Both of these are valid UTF-8 encodings of valid Unicode character strings, but one is in "composed" and the other in "decomposed" format.

OS X HFS Plus volumes store strings (e.g. filenames) as "fully decomposed". A Unix file-system is really stored according to how the filesystem driver chooses to store it. You can't make any blanket statements across different types of filesystems.

See the Wikipedia article on Unicode Equivalence for general discussion of composed vs decomposed forms, which mentions OS X specifically.

See Apple's Tech Q&A QA1235 (in Objective-C unfortunately) for information on converting forms.

A recent email thread on Apple's java-dev mailing list could be of some help to you.

Basically, you need to normalize the decomposed form into a composed form before you can compare the strings.

Stephen P 2010-09-01 00:33:00

Thanks for this great answer, it put me on the right track. See amended question with a summary of what I learned, and the specific solution (tl;dr -- use java.text.Normalizer)

James Murty 2010-09-01 06:09:04

I faced this kind of problem before. It's good to know the general theory behind it. Thanks!

helios 2010-09-01 07:14:49

ansaurus

tags:

views:

answers:

File.listFiles() mangles unicode names with JDK 6 (Unicode Normalization issues)

Solution

related questions