views:

836

answers:

3

I would like to access a PHP file whose name has UTF-8 characters in it.

The file does not have a BOM in it. It just contains an echo statement that displays a few unicode characters.

Accessing the PHP page from the browser (FireFox 3.0.8, IE7) results in HTTP error 500.

There are two entries in the Apache log (file is /க.php; the letter க is a composite one and corresponds to the characters \xe0\xae\x95 in the log below):

[Sat Apr 04 09:30:25 2009] [error] [client 127.0.0.1] PHP Warning: Unknown: failed to open stream: No such file or directory in Unknown on line 0

[Sat Apr 04 09:30:25 2009] [error] [client 127.0.0.1] PHP Fatal error: Unknown: Failed opening required 'D:/va/ROOT/\xe0\xae\x95.php' (include_path='.;C:\php5\pear') in Unknown on line 0

The same page works when file and dir names are in English. In the same setup, there is no problem using SSI for these pages.

EDIT

Removed info on url rewriting since it does not seem to be a factor.

When mod_rewrite is removed, the PHP file still does not work. Works if the file is renamed to a non-UTF name. However, shtml works even with UTF characters in file and/or path name.

+1  A: 

Just because the character set is UTF-8 doesn't mean it supports all the higher characters of Unicode.

Unicode support is one of the major additions coming in PHP 6 and PHP 5 is nutorious for lacking unicode support.

If your PHP script is generating the link it may be a different issue than if apache is interpreting the url directly and redirecting it.

Fire Crow
+4  A: 
  • I know for a fact PHP itself can work with Unicode URLs, because I have tried using Unicode page names in MediaWiki (PHP-based, also runs WikiPedia) and it does work. Eg, URLs such as /index.php/Page_name©. So PHP can handle it. But it may be a problem with Apache finding a file where the source file has a UTF-8 name.

  • The PHP.ini setting for character encoding should not be affecting this; it is the job of the web server to find a specific resource and then call PHP once it's determined to be a PHP file. It will mean that the web server, and the underlying file system itself, have to be able to deal with UTF-8 filenames.

  • Does it work without the mod_rewrite rule? Ie, if you disable the rewrite engine with RewriteEngine off and then request va.in/utf_dir/utf_file.php? If so, then it may be a mod_rewrite config issue or a problem with the rule.

  • Unicode in URLs may not be properly supported in some browsers when you just type an address in, such as older browsers. Older browsers may skip the UTF-8 encoding step. This should not prevent it from working if you are following a link on a page, where that page is UTF-8 encoded, though.

thomasrutter
+1  A: 

I have come across the same problem and done some research and conclude the following. This is for php5 on Windows; it is probably true on other platforms but I haven't checked.

  1. ALL php file system functions (dir, is_dir, is_file, file, filemtime, filesize, file_exists etc) only accept and return file names in ISO-8859-1, irrespective of the default_charset set in the program or ini files.

  2. Where a filename contains a unicode character dir->read will return it as the corresponding ISO-8859-1 character if there is one, otherwise it will substitute a question mark.

  3. When referencing a file, e.g. in is_file or file, if you pass in a UTF-8 file name the file will not be found when the name contains any two-byte or more characters. However, is_file(utf8_decode($filename)) etc will work providing the UTF-8 character is representable in ISO-8859-1.

In other words, PHP5 is not capable of addressing files with multi-byte characters in their names at all.

If a UTF-8 URL with multibyte characters is requested and this corresponds directly to a file, PHP won't be able to open the file because it cannot address it.

If you simply want pretty URLs in your language the suggestion of using mod_rewrite seems like a good one.

But if you are storing and retrieving files uploaded and downloaded by users, this problem has to be resolved. One way is to use an arbitrary (non UTF-8) file name, such as an incrementing number, on the server and index the files in a database or XML file or some such. Another way is to store the files in the database itself as a BLOB. Another way (which is perhaps easier to see what is going on, and not subject to problems if your index gets corrupted) is to encode the filenames yourself - a good technique is to urlencode (sic) all your incoming filenames when storing on the server disk and urldecode them before setting the filename in the mime header for the download. All even vaguely unusual characters (except %) are then encoded as %nn and so any problems with spaces in file names, cross platform support and pattern matching are largely avoided.

David Earl