views:

663

answers:

3

I can't use mkdir to create folders with UTF-8 characters.

<?php

$dir_name = "Depósito";
mkdir($dir_name );

?>

But, when I browse this folder in Windows Explorer, the folder name looks like this:

Depósito

What should I do?

+3  A: 

The problem is that Windows uses utf-16 for filesystem strings, whereas Linux and others use different character sets, but often utf-8. You provided a utf-8 string, but this is interpreted as another 8-bit character set encoding in Windows, maybe Latin-1, and then the non-ascii character, which is encoded with 2 bytes in utf-8, is handled as if it was 2 characters in Windows.

A normal solution is to keep your source code 100% in ascii, and to have strings somewhere else. However, PHP6 introduces Unicode functions etc., so you might want to have a look at those.

Lars D
Very clear answer, Thanks!
Acacio Nerull
I haven't tried it, but can't you use mb_convert_encoding to convert the string the utf-16?
R. Bemrose
+4  A: 

Just urlencode the string desired as a filename. All characters returned from urlencode are valid in filenames (NTFS/HFS/UNIX), then you can just urldecode the filenames back to UTF-8 (or whatever encoding they were in).

Caveats (all apply to the solutions below as well):

  • After url-encoding, the filename must be less that 255 characters (probably bytes).
  • UTF-8 has multiple representations for many characters (using combining characters). If you don't normalize your UTF-8, you may have trouble searching with glob or reopening an individual file.
  • You can't rely on scandir or similar functions for alpha-sorting. You must urldecode the filenames then use a sorting algorithm aware of UTF-8 (and collations).

Worse Solutions

The following are less attractive solutions, more complicated and with more caveats.

On Windows, the PHP filesystem wrapper expects and returns ISO-8859-1 strings for file/directory names. This gives you two choices:

  1. Use UTF-8 freely in your filenames, but understand that non-ASCII characters will appear incorrect outside PHP. A non-ASCII UTF-8 char will be stored as multiple single ISO-8859-1 characters. E.g. ó will be appear as ó in Windows Explorer.

  2. Limit your file/directory names to characters representable in ISO-8859-1. In practice, you'll pass your UTF-8 strings through utf8_decode before using them in filesystem functions, and pass the entries scandir gives you through utf8_encode to get the original filenames in UTF-8.

Caveats galore!

  • If any byte passed to a filesystem function matches an invalid Windows filesystem character in ISO-8859-1, you're out of luck.
  • Windows may use an encoding other than ISO-8859-1 in non-English locales. I'd guess it will usually be one of ISO-8859-#, but this means you'll need to use mb_convert_encoding instead of utf8_decode.
  • PHP6 with unicode_semantics = On may change everything...

This nightmare is why you should probably just transliterate to create filenames.

mrclay
ISO-8859-1 is not more useful on Windows than ISO-8859-2 or ISO-8859-3. If you want to be safe, go with the 7-bit ASCII.
Lars D
+1  A: 

It is possible to interact with the filesystem on Windows using a combo of 8.3 ShortPath and a COM Scripting.FileSystemObject :

http://github.com/nicolas-grekas/Patchwork/blob/lab/windows/class/WIN.php

It is not bullet proof, as for example ShortPath support can be disabled on NTFS, but it should work quite well for experimenting at least.

Nicolas Grekas