ansaurus

Question

Java Can't Open a File with Surrogate Unicode Values in the Filename?

Answer 1

+2 A:

If your environment's default locale does not include those characters you cannot open the file.

See:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4733494

Edit: Alright.. What you need is to change the system locale. Whatever OS you are using.

JCasso 2009-10-09 19:35:00

Is it not possible to do this without changing the system locale? The program I am building will need to run on any locale, and I should be able to input these characters and deal with these files even in a US/English locale.

Bear 2009-10-26 18:22:44

Answer 2

+4 A:

I suspect one of Java or Mac is using CESU-8 instead of proper UTF-8. Java uses “modified UTF-8” (which is a slight variation of CESU-8) for a variety of internal purposes, but I wasn't aware it could use it as a filesystem/defaultCharset. Unfortunately I have neither Mac nor Java here to test with.

“Modified” is a modified way of saying “badly bugged”. Instead of outputting a four-byte UTF-8 sequence for supplementary (non-BMP) characters like 𦿶:

\xF0\xA6\xBF\xB6

it outputs a UTF-8-encoded sequence for each of the surrogates:

\xED\xA1\x9B\xED\xBF\xB6

This isn't a valid UTF-8 sequence, but a lot of decoders will allow it anyway. Problem is if you round-trip that through a real UTF-8 encoder you've got a different string, the four-byte one above. Try to access the file with that name and boom! fail.

So first let's just check how filenames are actually stored under your current filesystem, using a platform that uses bytes for filenames such as Python 2.x:

$ python
Python 2.x.something (blah blah)
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.listdir('.')

On my filesystem (Linux, ext4, UTF-8), the filename “草𦿶鷗外.gif” comes out as:

['\xe8\x8d\x89\xf0\xa6\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif']

which is what you want. If that's what you get, it's probably Java doing it wrong. If you get the longer six-byte-character version:

['\xe8\x8d\x89\xed\xa1\x9b\xed\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif']

it's probably OS X doing it wrong... does it always store filenames like this? (Or did the files come from somewhere else originally?) What if you rename the file to the ‘proper’ version?:

os.rename('\xe8\x8d\x89\xed\xa1\x9b\xed\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif', '\xe8\x8d\x89\xf0\xa6\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif')

bobince 2009-10-09 20:31:24

Not really a bug as it's part of the spec (even if it is often confusing.)

finnw 2009-10-10 00:20:56

The result of the python commands was the proper filename you listed first, so it must be Java not playing nice.

Bear 2009-10-26 18:19:54

Oh, that's unfortunate. Even if you detected the broken-CESU-8 situation, I can't think of any way to work around it and get a byte-oriented filename interface. :-( You might have to explicitly disallow the surrogates until such time as Sun fix it. How poor.

bobince 2009-10-26 18:37:01

Answer 3

+1 A:

This turned out to be a problem with the Mac JVM (tested on 1.5 and 1.6). Filenames containing supplementary characters / surrogate pairs cannot be accessed with the Java File class. I ended up writing a JNI library with Carbon calls for the Mac version of the project (ick). I suspect the CESU-8 issue bobince mentioned, as the JNI call to get UTF-8 characters returned a CESU-8 string. Doesn't look like it's something you can really get around.

Bear 2009-11-25 21:05:17

ansaurus

tags:

views:

answers:

Java Can't Open a File with Surrogate Unicode Values in the Filename?

related questions