views:

165

answers:

3

I have a directory which contains several files, many of which has non-english name. I am using PHP in Windows 7.

I want to list the filename and their content using PHP.

Currently I am using DirectoryIterator and file_get_contents. This works for English files names but not for non-English (chinese) file names.

For example, I have filenames like "एक और प्रोब्लेम.eml", "hello 鶨鶖鵨鶣鎹鎣.eml".

  1. DirectoryIterator is not able to get the filename using ->getFilename()
  2. file_get_contents is also not able to open even if I hard code the filename in its parameter.

How can I do it?

A: 

Do discover the files I have this script:

$content = scandir($directory);
$list = "<select size = 5 name ='file' id='file'>\n";
for($i = 0; $i < count ( $content ); $i ++) {
    $list .= "<option>$content[$i] </option>\n";
}
$list .= "</select>\n";

This will succesfully find the file: 鶨鶖鵨鶣鎹鎣 I tried it here on a Linux distro though..

to read it you use: Line by line:

$lines = file('file.txt');
//loop through our array, show HTML source as HTML source; and line numbers too.
foreach ($lines as $line_num => $line) {
print "Line #<b>{$line_num}</b> : " . htmlspecialchars($line) . "<br />\n";//or try it without the htmlspecialchars
}
Robijntje007
Yes, the problem is Windows.
Artefacto
+1  A: 

You give little detail about how it fails but, in my experience, the main problem with internationalized file names in PHP comes from using different charsets in your code and in your file system. I believe that NTFS uses UTF-16. If your script is encoded in, e.g., UTF-8, when you hard-code a non-English name you are using the UTF-8 encoding so they file will not be found.

You can use iconv() to translate the names.

Edit

Unicode can be hard to test due to limited support by most apps, including text editors. Browsers do it quite well so I wrote this test script and tested with Firefox:

<?php /* Save as UTF-8 without BOM (€ÁÑ) */

header('Content-Type: text/html; charset=utf-8');

if( isset($_POST['filename']) ){
    touch($_POST['filename']);
}

?><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"&gt;
<html>
<head><title></title>
</head>
<body>

<form action="" method="post">
<input type="text" name="filename" size="50">
<input type="submit" value="Create file">
</form>

<?php

echo '<ul>';
foreach(glob('*') as $i){
    echo '<li>' . htmlspecialchars($i) . '</li>';
}
echo '</ul>';

?>

</body>
</html>

Then, you can use http://www.lorem-ipsum.info/ to fetch some strings in exotic languages. My system (Windows XP) is using codepage Win-1252 (Eastern Europe) but that fact doesn't prevent PHP from creating and reading files like "知是指.txt". Of course, Windows explorer displays garbage.

Álvaro G. Vicario
You won't get anything you can traslate. `FindFirstFile` will return question marks in place of the characters that cannot be represented in the current codepage.
Artefacto
That's a Windows API function, isn't it? Does it really replace unknown chars with question marks?
Álvaro G. Vicario
I mean ACTUAL QUESTION MARKS; the actual "?" character – I'm not confused about character encodings.
Artefacto
@Álvaro G. Vicario Wait a second, I'll confirm it and post the actual debugger results.
Artefacto
If you're actually getting those results, I'll check if there's some compile switch that activates the correct behavior.
Artefacto
OK, you're cheating. If Windows explorer displays garbage, you're not giving the files the correct name. You're naming the files `知是指.txt` or something like that which happens to translate to `知是指.txt` when each individual character is interpreted as a byte part of a UTF-8 string, not actually `知是指.txt`
Artefacto
It might be pure chance: if a double byte Chinese char happens to be equal to two single byte chars, you'll get the sum of two incorrect operations that happens to be correct. And my script tests in UTF-8 which I don't think is what Windows uses.
Álvaro G. Vicario
The fact that windows stores the filenames in UTF-16 or UTF-32 or UTF-8 or whatever is irrelevant. What matters is that the there is a WINAPI function that returns the filename in the current codepage (and the characters it can't display will be converted to question marks) and there's another function that returns the filename encoded in UTF-16, which PHP does not use.
Artefacto
On the other hand, the encoding used in the filenames would be relevant for the EXT family in Linux et al. because there's no notion of "character". You can use the character encoding you wish, but you must be consistent. See http://www.mail-archive.com/[email protected]/msg10289.html
Artefacto
+2  A: 

This is not possible. It's a limitation of PHP. PHP uses the multibyte versions of Windows APIs; you're limited to the characters your codepage can represent.

See this answer.

Directory contents:

D:\Users\Cataphract\Desktop\teste2>dir
 Volume in drive D is GRANDEDISCO
 Volume Serial Number is 945F-DB89

 Directory of D:\Users\Cataphract\Desktop\teste2

01-06-2010  17:16              .
01-06-2010  17:16              ..
01-06-2010  17:15                 0 coptic small letter shima follows ϭ.txt
01-06-2010  17:18                86 teste.php
               2 File(s)             86 bytes
               2 Dir(s)  12.178.505.728 bytes free

Test file contents:

<?php
exec('pause');
foreach (new DirectoryIterator(".") as $v) {
    echo $v."\n";
}

Test file results:

.
..
coptic small letter shima follows ?.txt
teste.php

Debugger output:

Call stack (PHP 5.3.0):

>   php5ts_debug.dll!readdir_r(DIR * dp=0x02f94068, dirent * entry=0x00a7e7cc, dirent * * result=0x00a7e7c0)  Line 80   C
    php5ts_debug.dll!php_plain_files_dirstream_read(_php_stream * stream=0x02b94280, char * buf=0x02b9437c, unsigned int count=260, void * * * tsrm_ls=0x028a15c0)  Line 820 + 0x17 bytes   C
    php5ts_debug.dll!_php_stream_read(_php_stream * stream=0x02b94280, char * buf=0x02b9437c, unsigned int size=260, void * * * tsrm_ls=0x028a15c0)  Line 603 + 0x1c bytes  C
    php5ts_debug.dll!_php_stream_readdir(_php_stream * dirstream=0x02b94280, _php_stream_dirent * ent=0x02b9437c, void * * * tsrm_ls=0x028a15c0)  Line 1806 + 0x16 bytes    C
    php5ts_debug.dll!spl_filesystem_dir_read(_spl_filesystem_object * intern=0x02b94340, void * * * tsrm_ls=0x028a15c0)  Line 199 + 0x20 bytes  C
    php5ts_debug.dll!spl_filesystem_dir_open(_spl_filesystem_object * intern=0x02b94340, char * path=0x02b957f0, void * * * tsrm_ls=0x028a15c0)  Line 238 + 0xd bytes   C
    php5ts_debug.dll!spl_filesystem_object_construct(int ht=1, _zval_struct * return_value=0x02b91f88, _zval_struct * * return_value_ptr=0x00000000, _zval_struct * this_ptr=0x02b92028, int return_value_used=0, void * * * tsrm_ls=0x028a15c0, long ctor_flags=0)  Line 645 + 0x11 bytes  C
    php5ts_debug.dll!zim_spl_DirectoryIterator___construct(int ht=1, _zval_struct * return_value=0x02b91f88, _zval_struct * * return_value_ptr=0x00000000, _zval_struct * this_ptr=0x02b92028, int return_value_used=0, void * * * tsrm_ls=0x028a15c0)  Line 658 + 0x1f bytes   C
    php5ts_debug.dll!zend_do_fcall_common_helper_SPEC(_zend_execute_data * execute_data=0x02bc0098, void * * * tsrm_ls=0x028a15c0)  Line 313 + 0x78 bytes   C
    php5ts_debug.dll!ZEND_DO_FCALL_BY_NAME_SPEC_HANDLER(_zend_execute_data * execute_data=0x02bc0098, void * * * tsrm_ls=0x028a15c0)  Line 423  C
    php5ts_debug.dll!execute(_zend_op_array * op_array=0x02b93888, void * * * tsrm_ls=0x028a15c0)  Line 104 + 0x11 bytes    C
    php5ts_debug.dll!zend_execute_scripts(int type=8, void * * * tsrm_ls=0x028a15c0, _zval_struct * * retval=0x00000000, int file_count=3, ...)  Line 1188 + 0x21 bytes C
    php5ts_debug.dll!php_execute_script(_zend_file_handle * primary_file=0x00a7fad4, void * * * tsrm_ls=0x028a15c0)  Line 2196 + 0x1b bytes C
    php.exe!main(int argc=2, char * * argv=0x028a14c0)  Line 1188 + 0x13 bytes  C
    php.exe!__tmainCRTStartup()  Line 555 + 0x19 bytes  C
    php.exe!mainCRTStartup()  Line 371  C

Is it really a question mark?

dp->fileinfo
{dwFileAttributes=32 ftCreationTime={...} ftLastAccessTime={...} ...}
    dwFileAttributes: 32
    ftCreationTime: {dwLowDateTime=2784934701 dwHighDateTime=30081445 }
    ftLastAccessTime: {dwLowDateTime=2784934701 dwHighDateTime=30081445 }
    ftLastWriteTime: {dwLowDateTime=2784934701 dwHighDateTime=30081445 }
    nFileSizeHigh: 0
    nFileSizeLow: 0
    dwReserved0: 3435973836
    dwReserved1: 3435973836
    cFileName: 0x02f9409c "coptic small letter shima follows ?.txt"
    cAlternateFileName: 0x02f941a0 "COPTIC~1.TXT"
dp->fileinfo.cFileName[34]
63 '?'

Yes! It's character #63.

Artefacto
Can't he just read and write names as single bytes?
Álvaro G. Vicario
@Álvaro G. Vicario He could, but he wouldn't have proper names. NTFS supports proper UCS-2 file names, what you're describing is a hack.
Artefacto
Your explanation could not be better. I've learnt a lot today :)
Álvaro G. Vicario