views:

29

answers:

2

I just setup PortablePython on my system, so I can run python scripts from PHP and I got some very basic code (Below) to list all the files in a directory, however it doesn't work with Japanese filenames. It works fine with English filenames, but it spits out errors (Below) when I put any file containing Japanese characters in the directory.

import os, glob

path = 'G:\path'
for infile in glob.glob( os.path.join(path, '*') ):
    print("current file is: ", infile)

It works fine using 'PyScripter-Portable.exe', however when I try to run 'PortablePython\App\python.exe "test.py"' in the command prompt or from PHP it spits out the following errors:

current file is:  Traceback (most recent call last):
  File "test.py", line 5, in <module>
    print("current file is: ", infile)
  File "PortablePython\App\lib\io.py", line 1494, in write
    b = encoder.encode(s)
  File "PortablePython\App\lib\encodings\cp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 37-40: character maps to <undefined>



I'm very new to Python and am just using this to get around a PHP issue with not being able to read unicode filenames in Windows... So I really need this to work - any help you can give me would be great.

+1  A: 

The problem is probably that whatever output destination you're printing to doesn't use the same encoding as the file system. The general rule is that you should get text into Unicode as soon as possible, and then convert to whatever byte encoding you need upon output (e.g. utf-8).

Since you're dealing with filenames, they should be in the system encoding.

import sys
fse = sys.getfilesystemencoding()
filenames = [unicode(x, fse) for x in glob.glob( os.path.join(path, '*') )]

Now all your filenames are Unicode, and you need to figure out the correct encoding to output from the command prompt or whatever (you can launch a Unicode version of the command prompt with the u flag: "cmd /u")

Ryan Ginstrom
That doesn't work for me (I'm using 3.0 btw). Tips? Also, cmd /u still spits out the same errors.
Jon
+1  A: 

Assuming you're using python 2.x, try changing your strings to unicode, like this:

path = u'G:\path'
for infile in glob.glob( os.path.join(path, u'*') ):
    print( u"current file is: ", infile)

That should let python's filesystem-related functions know that you want to work with unicode file names.

Forest
I'm using 3.0, will it still work?
Jon
Hm... maybe not. Python 3.0 already uses unicode for its strings.
Forest
Out of curiosity, what happens when you replace your print statement with this? `print( infile.encode('utf8'))`
Forest
It actually works lol, but returns escaped characters or whatever (\x8f\xe3\x81\x97 - etc.)... This is a dumb question because I swear I have worked with these before, but how do I decode these? Mainly in PHP. I -KNOW- there is a function for it but I can't think of what to search for to find it again.Thanks by the way, this helped a lot.
Jon
Update: So I tried to use utf8_decode and it didn't work... Then I tried to just paste it into a print statement in double quotes - and it decodes the string. Doesn't help for decoding everything but it sure does make me happy. Know of a function to do this?
Jon
Last update: Thanks to Johan K on php.net's utf8_decode comment section, I've fixed it. In case anyone was wondering I just needed to use the following code: <br /> $return = preg_replace("#(\\\x[0-9A-Fa-f]{2})#e", "chr(hexdec('\\1'))", $return);
Jon