views:

47

answers:

2

I've been working on a python script to open up a file with a unicode name (Japanese mostly) and save to a randomly generated (Non-unicode) filename in Windows Vista 64-bit, and I'm having issues... It just doesn't work, it works fine with non-unicode filenames (Even if it has unicode content), but the second you try to pass a unicode filename in - it doesn't work.
Here's the code:

try:
    import sys, os
    inpath = sys.argv[1]
    outpath = sys.argv[2]
    filein = open(inpath, "rb")
    contents = filein.read()
    fileSave = open(outpath, "wb")
    fileSave.write(contents)
    fileSave.close()

    testfile = open(outpath + '.test', 'wb')
    testfile.write(inpath)
    testfile.close()

except:
    errlog = open('G:\\log.txt', 'w')
    errlog.write(str(sys.exc_info()))
    errlog.close()



And the error:

(<type 'exceptions.IOError'>, IOError(2, 'No such file or directory'), <traceback object at 0x01092A30>)
+1  A: 

My guess is that sys.argv1 and sys.argv[2] are just byte arrays and don't support natively Unicode. You could confirm this by printing them and seeing if they are the character you expect. You should also print type(sys.argv1) to make sure they are of the correct type.

Where do the command-line parameters come from? Do they come from another program or are you typing them on the command-line? If they come from another program, you could have the other program encode them to UTF-8 and then have your Python program decode them from UTF-8.

Which version of Python are you using?

Edit: here's a robust solution: http://code.activestate.com/recipes/572200/

Daniel Stutzbach
Actually I did something similar - I took out the first open, and replaced content with both the args that are passed in - and they're passed in fine, the file is saved with the exact filenames that I set, even if the first one is unicode.And it's coming from PHP, but that's not an issue - I've also tried by hard-coding the filenames.
Jon
@Jon: It saves the exact filenames that you set... but in what encoding? You should also save str(type(sys.argv[1])) to see if Python thinks that sys.argv is a str or unicode type.
Daniel Stutzbach
I'm not sure, what's the best way to check? Also, I put str(type(sys.argv[1])) into the error log and it spits out: <type 'str'>
Jon
It's definitely a str type, in Python 2 its always a str type unless if you convert it to unicode.
jcao219
@jcao219 this all seems like a massive pita... What do I need to do to get this working in 3? When I tried to use Python 3.0.1's python.exe to run the script, it spits out some error messages.
Jon
Take out the unicode() function, it does not exist in Python 3 because all strings are unicode.Also change `except IOError, e:` to `except IOError as e:`
jcao219
@Jon: I edited my answer with a link to a solution that should work (in Python 2)
Daniel Stutzbach
@Daniel: Wow! that's a cool solution with ctypes. It would make this program windows-specific, but that is okay I think for the OP's purpose.
jcao219
Thanks for the help, the other answer turned out to be what I needed but I'm definitely saving that link in case I need it later.
Jon
@jcao219: True, but only Windows uses Unicode natively for filenames anyway. On Unixy systems you can just pass bytes around.
Daniel Stutzbach
+2  A: 

You have to convert your inpath to unicode, like this:

inpath = sys.argv[1]
inpath = inpath.decode("UTF-8")
filein = open(inpath, "rb")

I'm guessing you are using Python 2.6, because in Python 3, all strings are unicode by default, so this problem wouldn't happen.

jcao219
Nope, still the same error... And yup, I'm using 2.6.
Jon
What makes you think that sys.argv[1] is encoded with UTF-8?
Daniel Stutzbach
@Daniel, that's right, it's probably encoded in something else, such as shift_jis, euc-jp, or iso-2022-jp perhaps.
jcao219
Can you post the entire exception, instead of using `sys.exc_info()`, you can do `except IOError, e:` and then write `str(e)` into the error log file.
jcao219
No such file or directory: u'G:\\misc\\2010-06-22 \u690d\u8349\u3042\u304b\u306d.txt'
Jon
What is the actual file's name in Japanese?
jcao219
Actually, there was a bit of a blunder on my part... I fixed it and it turns out inpath.decode("UTF-8") worked Thanks :)
Jon
No problem. UTF-8 is so common these days because it is basically extended ASCII.
jcao219