ansaurus

Question

File names containing non-ascii international language characters

Answer 1

A:

I have been playing around with Unicode and Indian languages for a while now. Here are my views on your questions:

Its easy. You will need two things: Enable Unicode (UTF-8/16/32) support in your OS so that you can type those characters and get Unicode compatible editors/tools so that your tools understand those characters.

Also, since you are looking at a localised web application, you have to ensure or atleast inform your visitor that he/she needs to have a browser which uses relevant encoding.

Your file extensions need not be i18-ned.

Amit 2009-02-26 03:28:57

Answer 2

+1 A:

Refer to this overview of file name limitations on Wikipedia.

You will have to consider where your files will travel, and stay within the most restrictive set of rules.

cdonner 2009-02-26 03:35:04

Answer 3

+1 A:

From my experience in Japan, filenames are typically saved in Japanese with the standard English extension. Apply the same to any other language.

The only problem you will run into is that in an unsupported environment for that character set, people will usually just see a whole bunch of squares with an extension. Obviously this won't be a problem for your target users.

bojo 2009-02-26 03:46:38

Answer 4

+3 A:

Is this funtionality expected from Japanese/Chinese speaking web users?

Yes.

Is doing this an easy thing to achieve, or is it fraught with danger?

There are issues. If you are serving files directly, or otherwise have the filename in the URL (eg.: http://www.example.com/files/こんにちは.txt -> http://www.example.com/files/%E3%81%93%E3%82%93%E3%81%AB%E3%81%A1%E3%81%AF.txt), you're generally OK.

But if you're serving files with the filename generated by the script, you can have problems. The issue is with the header:

Content-Disposition: attachment;filename="こんにちは.txt"

How do we encode those characters into the filename parameter? Well it would be nice if we could just dump it in in UTF-8. And that will work in some browsers. But not IE, which uses the system codepage to decode characters from HTTP headers. On Windows, the system codepage might be cp1252 (Latin-1) for Western users, or cp932 (Shift-JIS) for Japanese, or something else completely, but it will never be UTF-8 and you can't really guess what it's going to be in advance of sending the header.

Tedious aside: what does the standard say should happen? Well, it doesn't really. The HTTP standard, RFC2616, says that bytes in HTTP headers are ISO-8859-1, which wouldn't allow us to use Japanese. It goes on to say that non-Latin-1 characters can be embedded in a header by the rules of RFC2047, but RFC2047 explicitly denies that its encoded-words can fit in a quoted-string. Normally in RFC822-family headers you would use RFC2231 rules to embed Unicode characters in a parameter of a Content-Disposition (RFC2183) header, and RFC2616 does defer to RFC2183 for definition of that header. But HTTP is not actually an RFC822-family protocol and its header syntax is not completely compatible with the 822 family anyway. In summary, the standard is a bloody mess and no-one knows what to do, certainly not the browser manufacturers who pay no attention to it whatsoever. Hell, they can't even get the ‘quoted-string’ format of ‘filename="..."’ right, never mind character encodings.

So if you want to serve a file dynamically with non-ASCII characters in the name, the trick is to avoid sending the ‘filename’ parameter and instead dump the filename you want in a trailing part of the URL.

Should file extensions also be international language characters?

In principle yes, file extensions are just part of the filename and can contain any character.

In practice on Windows I know of no application that has ever used a non-ASCII file extension.

One final thing to look out for on systems for East Asian users: you will find them typing weird, non-ASCII versions of Latin characters sometimes. These are known as the full-width and half-width forms, and are designed to allow Asians to type Latin characters that line up with the square grid used by their ideographic (Han etc.) characters.

That's all very well in free text, but for fields you expect to parse as Latin text or numbers, receiving an unexpected ‘４２’ integer or ‘．ｔｘｔ’ file extension can trip you up. To convert these ‘compatibility characters’ down to plain Latin, normalise your strings to ‘Unicode Normal Form NFKC’ before doing anything with them.

bobince 2009-02-26 11:29:56

ansaurus

tags:

views:

answers:

File names containing non-ascii international language characters

related questions