views:

160

answers:

3

It throws out "UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)" when executing following code:

filename = 'Spywaj.ttf'
print repr(filename)
>> 'Sp\xc2\x88ywaj.ttf'
filepath = os.path.join('/dirname', filename)

But the file is valid and existed on disk. Filename was extracted from "unzip -l" command. How can join filenames like this?

OS and filesystem

Filesystem: ext3    relatime,errors=remount-ro 0       0
Locale: en_US.UTF-8

Alex's suggestion os.path.join works now but I still cannot access the file on disk with the filename it joined.

filename = filename.decode('utf-8')
filepath = os.path.join('/dirname', filename)
print filepath
>> /dirname/u'Sp\xc2\x88ywaj.ttf'
print os.path.isfile(filepath)
>> False

new_filepath = filepath.encode('Latin-1').encode('utf-8')
print new_filepath
>> /dirname/u'Sp\xc2\x88ywaj.ttf'
print type(filepath)
>> <type 'unicode'>
print os.path.isfile(new_filepath)
>> False

valid_filepath = glob.glob('/dirname/*.ttf')[0]
print valid_filepath
>> /dirname/Spywaj.ttf (SO cannot display the chars in filename)
print type(valid_filepath)
>> <type 'str'>
print os.path.isfile(valid_filepath)
>> True
+2  A: 

In both Latin-1 (ISO-8859-1) and Windows-1252, 0xc2 would a capital A with a circumflex accent... doesn't seem to be anywhere in the code you show! Can you please add a

print repr(filename)

before the os.path.join call (and also put the '/dirname' in a variable and print its repr for completeness?). I'm thinking that maybe that stray character is there but you're not seeing it for some reason -- the repr will reveal it.

If you do have a Latin-1 (or Win-1252) non-Ascii character in your filename, you have to use Unicode -- and/or, depending on your OS and filesystem, some specific encoding thereof.

Edit: the OP confirms, thanks to repr, that there are actually two bytes that can't possibly be ASCII -- 0xc2 then 0x88, corresponding to what the OP thinks is one lowercase L. Well, that sequence would be a Unicode uppercase A with caret (codepoint 0x88) in the justly popular UTF-8 encoding - how that could look like a lowercase L to the OP beggars explanation, but I imagine some fonts could be graphically crazy enough to afford such confusion.

So I would first try filename = filename.decode('utf-8') -- that should allow the os.path.join to work. If open then balks at the resulting Unicode string (it might work, depending on the filesystem and OS), next attempt is to try using that Unicode object's .encode('Latin-1') and .encode('utf-8'). If none of the encodings work, information on the OS and filesystem in use, which the OP, I believe, hasn't given yet, becomes crucial.

Alex Martelli
@Alex, the char is not showing because SO's editor just ignored it after question submitted. I just added the repr(filename) to above code. I'm sure '/dirname' part contains all ASCII chars
jack
+1 for divination of the problem.
Adam Bernier
@Alex, thanks for updates, I tried your method but still cannot access the file on disk with the filename it joined.
jack
@Alex, U+0088 is a C1 control character, it's not uppercase A with anything. Caret??? U+00C2 is LATIN CAPITAL LETTER WITH CIRCUMFLEX ... is that what you meant?
John Machin
@Jack, if you can't get it to work, why have you accepted Alex's answer???
John Machin
@John, I finally found the reason, actuall '/dirname' is not a constant but a variable which is unicode type. I get it work by using os.path.join(str(dir_variable), filename.encode('raw_unicode_escape')). Becaus Alex was the first one to answer and due to his detailed explanation, I accept his answer. I also voted up S Mark's answer. Thanks to all.
jack
@jack: Has `filename` suddenly become a unicode object?? Alex's explanations are always detailed :-) but in this case not relevant to your real problem, which Ignacio was on to right from the beginning but you said that you tried '/dirname' and it didn't work -- this can not have been true (unless you made some other (unpublished) change to your code). In future please show the actual code, actual output, and actual traceback -- don't make it up.
John Machin
@John, the 2-byte sequence 0xC2, 0x88, is utf-8 encoding for Unicode codepoint U+00C2, a capital A with a circumflex aka caret over it; not sure what you mean by "C1" since that would be odd and c2 and 88 are both even.
Alex Martelli
@Alex: Consider testing your statements in the interactive interpreter before publishing: `'\xc2\x88'.decode('utf8')` produces `u'\x88'` (as you said yourself); `u'\u00c2'.encode('utf8')` produces `'\xc3\x82'`. U+0088 is (as I said) a C1 control character i.e. one of those in the interval U+0080 to U+009F inclusive; C0 control characters are U+0000 to U+001F. Caret != circumflex; consult a reputable dictionary, not urbanfictionary.com :-)
John Machin
+1  A: 
filename = filename.decode('utf-8').encode("latin-1")

works for me with the file from Splywaj.zip

>>> os.path.isfile(filename.decode("utf8").encode("latin-1"))
True
>>>
S.Mark
A: 

=== Evidence problem 1 ===

"""It throws out "UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)" when executing following code:"""

filename = 'Spywaj.ttf'
print repr(filename)
>> 'Sp\xc2\x88ywaj.ttf'
filepath = os.path.join('/dirname', filename)

I can't see how it is possible to get that exception -- both args of os.path.join are str objects. There is no reason to try converting anything to unicode. Are you sure that the above code is exactly what you ran?

=== Evidence problem 2 ===

"""Alex's suggestion os.path.join works now but I still cannot access the file on disk with the filename it joined."""

filename = filename.decode('utf-8')
filepath = os.path.join('/dirname', filename)
print filepath
>> /dirname/u'Sp\xc2\x88ywaj.ttf'

Sorry, assuming that filename has not changed from the previous snippet, that's definitely impossible. It looks like the result of os.path.join('/dirname', repr(filename)) ... please ensure that you publish the code that you actually ran, together with actual output (and actual traceback, if any).

=== Confusion ===

new_filepath = filepath.encode('Latin-1').encode('utf-8')

Alex meant to try twice, each time with one of those encodings -- not try once with both encodings! As all the characters in filepath were in the ASCII range (see evidence problem 2) the effect was simply filepath.encode('ascii')

=== Simple solution ===

You know how to find the name of the file that you are interested in:

valid_filepath = glob.glob('/dirname/*.ttf')[0]

If you must hard-code that name in your script, you can use the repr() function to get the representation that you can type into your script without worrying about utf8, unicode, encode, decode and all that noise:

print repr(valid_filepath)

Let's suppose that it prints '/dirname/Sp\xc2\x88ywaj.ttf' ... then all you need to do is carefully copy that and paste it into your script:

file_path = '/dirname/Sp\xc2\x88ywaj.ttf'
John Machin