views:

366

answers:

3

I get encoding error on this line:

s =  "%s:%s: %s: %s\n" % (filename, lineno, category.__name__, message)

UnicodeEncodeError: 'ascii' codec can't encode character u'\xc4' in position 44: ordinal not in range(128)

I tried to reproduce this error by passing all combinations of parameters to string format, but closest I got was "ascii decode" error (by passing unicode and high ascii string simultaneously, which forced conversion of string to unicode, using ascii decoder.

However, I did not manage to get "ascii encode" error. Anybody has an idea?

+1  A: 

One of the operands you are passing is not suitable for ASCII encoding - perhaps it contains either Unicode or Latin-1 characters. Change the format string to Unicode and see what happens.

Vinay Sajip
This should produce a _decode_ error, i.e. s = "%s %s" % (unichr(2000), chr(200))The error here seems to be something else.
Joakim Lundborg
@cortex: Sometimes Python decides not to coerce to unicode, but coerce to string. I'm not sure exactly how that decision is made.
Lennart Regebro
+2  A: 

This happens when Python tries to coerce an argument:

s = u"\u00fc"
print str(s)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 0: ordinal not in range(128)

This happens because one of your arguments is an object (not a string of any kind) and Python calls str() on it. There are two solutions: Use a unicode string for the format (s = u"%s...") or wrap each argument with repr().

Aaron Digulla
The format is in the warnings module, so I'd rather not change it, but adding repr() around passed parameters sounds really good. Thank you!
Deda Mraz
Then you'll get extra quotes and an extra u. Works as a hack, but not very pretty.
Lennart Regebro
Error is happening when warning thrown by the database is captured for a logging. As logging failed, I'm completely in the dark about original problem, which is the worst place to be. I like my logs readable as the next person, so I've decided to wrap formatting in a try: except: block, doing it first "nicely", and using repr() only in the case of encoding error, including throwing extra warning about the encoding issue. IMHO, That is not a hack, it's better, safer, logging.
Deda Mraz
Logging ought to be unicode safe but that's probably only true for Python 3.
Aaron Digulla
Logging is I believe Unicode safe under Python 2.x, as long as you know what you're doing (that's generally the case for Unicode and Python 2.x, not just logging). Any messages that are in bytes (i.e. str objects rather than Unicode) need to be converted to Unicode using the appropriate encoding, otherwise you'll get these kinds of problems - caused by mixing str and unicode incorrectly. File-based logging handlers allow you to specify an encoding, and stream-based handlers can take a stream which has an encoding wrapper around it.
Vinay Sajip
+1  A: 

You are mixing unicode and str objects.

Explanation: In Python 2.x, there are two kinds of objects that can contain text strings. str, and unicode. str is a string of bytes, so it can only contain characters between 0 and 255. Unicode is a string of unicode characters.

You can convert between str and unicode with the "encode" and "decode" methods:

>>> "thisisastring".decode('ascii')
u'thisisastring'

>>> u"This is ä string".encode('utf8')    
'This is \xc3\xa4 string'

Note the encodings. Encodings are ways of representing unicode text as only strings of bytes.

If you try to add str and unicode together, Python will try to convert one to the other. But by default it will use ASCII as a encoding, which means a-z, A-Z, and some extra characters like !"#$%&/()=?'{[]]} etc. Anything else will fail.

You will at that point either get a encoding error or a decoding error, depending on if Python tries to convert the unicode to str or str to unicode. Usually it tries to decode, that is convert to unicode. But sometimes it decides not to but to coerce to string. I'm not entirely sure why.

Update: The reason you get an encode error and not a decode error above is that message in the above code is neither str nor unicode. It's another object, that has a str method. Python therefore does str(message) before passing it in, and that fails, since the internally stores message is a unicode object that can't be coerced to ascii.

Or, more simply answered: It fails because warnings.warn() doesn't accept unicode messages.

Now, the solution:

Don't mix str and unicode. If you need to use unicode, and you apparently do, try to make sure all strings are unicode all the time. That's the only way to be sure you avoid this. This means that whenever you read in a string from disk, or a call to a function that may return anything else than pure ascii str, decode it to unicode as soon as possible. And when you need to save it to disk or send it over a network or pass it in to a method that do not understand unicode, encode it to str as late as possible.

In this specific case, the problem is that you pass unicode to warnings.warn() and you can't do that. Pass a string. If you don't know what it is (as seems to be the case here) because it comes from somewhere else, your try/except solutions with a repr works fine, although doing a encode would be a possibility to.

Lennart Regebro
I think the questioner is perfectly aware of the fact that the problem is that unicode and str is mixed somehow; the question is why this error is triggered on an operation that normally should coerce output to unicode.
Joakim Lundborg
Possible, but I went for an exhaustive answer. And the problem is still the mixing of unicode and str. Why it gets one error instead of the other in this specific case I don't know, I can't reproduce it. But I have seen it happen myself.
Lennart Regebro