views:

290

answers:

4

The following code examines the behaviour of the float() method when fed a non-ascii symbol:

import sys

try:
  float(u'\xbd')
except ValueError as e:
  print sys.getdefaultencoding() # in my system, this is 'ascii'
  print e[0].decode('latin-1') # u'invalid literal for float(): ' followed by the 1/2 (one half) character
  print unicode(e[0]) # raises "UnicodeDecodeError: 'ascii' codec can't decode byte 0xbd in position 29: ordinal not in range(128)"

My question: why is the error message e[0] encoded in Latin-1? The default encoding is Ascii, and this seems to be what unicode() expects.

Platform is Ubuntu 9.04, Python 2.6.2

+2  A: 

The ASCII encoding only includes the bytes with values <= 127. The range of characters represented by these bytes is identical in most encodings; in other words, "A" is chr(65) in ASCII, in latin-1, in UTF-8, and so on.

The one half symbol, however, is not part of the ASCII character set, so when Python tries to encode this symbol into ASCII, it can do nothing but fail.

Update: Here's what happens (I assume we're talking CPython):

float(u'\xbd') leads to PyFloat_FromString in floatobject.c being called. This function, giving a unicode object, in turn calls PyUnicode_EncodeDecimal in unicodeobject.c being called. From skimming over the code, I get it that this function turns the unicode object into a string by replacing every character with a unicode codepoint <256 with the byte of that value, i.e. the one half character, having the codepoint 189, is turned into chr(89).

Then, PyFloat_FromString does its work as usual. At this moment, it's working with a regular string, which happens to be containing a non-ASCII range byte. It doesn't care about this; it just finds a byte that's not a digit, a period or the like, so it raises the value error.

The argument to this exception is a string

"invalid literal for float(): " + evil_string

That's fine; an exception message is, after all, a string. It's only when you try to decode this string, using the default encoding ASCII, that this turns into a problem.

balpha
That doesn't answer my question, I'm afraid. I have slightly edited the original text to be more clear. (I changed the question in bold).I understand Ascii cannot offer a representation for the strange character. My issue is that e[0] seems to be encoded in Latin-1, even though Ascii is the default encoding.My reasoning is that float() should have raised a Ascii-encoded exception (or Unicode). However, it's in Latin-1 or something similar instead. It should have tried to encode the error message in Ascii and failed, raising a UnicodeDecodeError in the first place, not a ValueError.
pablobm
Okay, I have updated the answer.
balpha
A: 

From experimenting with you code snippet, it would seem I have the same behavior on my platform (Py2.6 on OS X 10.5).

Since you established that e[0] is encoded with latin-1, the correct way to convert it unicode is to do .decode('latin-1'), and not unicode(e[0]).

Update: So it sounds like e[0] does not have a valid encoding. Definetely not latin-1. Because of that, as mentioned elsewhere in the comments, you'll have to call repr(e[0]) if you need to display this error message w/o causing a cascading exception.

Pavel Repin
It's not encoding as latin-1, it's just converting unicode ordinal values to bytes.
John Millikin
Updated, thanks.
Pavel Repin
+4  A: 

e[0] isn't encoded with latin-1; it just so happens that the byte \xbd, when decoded as latin-1, is the character U+00BD.

The conversion occurs in Objects/floatobject.c.

First, the unicode string must be converted to a byte string. This is performed using PyUnicode_EncodeDecimal():

if (PyUnicode_EncodeDecimal(PyUnicode_AS_UNICODE(v),
                            PyUnicode_GET_SIZE(v),
                            s_buffer,
                            NULL))
        return NULL;

which is implemented in unicodeobject.c. It doesn't perform any sort of character set conversion, it just writes bytes with values equal to the unicode ordinals of the string. In this case, U+00BD -> 0xBD.

The statement formatting the error is:

PyOS_snprintf(buffer, sizeof(buffer),
              "invalid literal for float(): %.200s", s);

where s contains the byte string created earlier. PyOS_snprintf() writes a byte string, and s is a byte string, so it just includes it directly.

John Millikin
Should this be considered a bug in Python? My reasoning: if float() received a Unicode string, it should throw a Unicode-described exception if the message is going to include the input. Otherwise exceptions cannot be handled safely, as the example shows.
pablobm
I think calling it a bug is fair -- the error messaeg should probably contain `repr(v)` instead of `str(s)`, as knowing the original input value is more useful than the decimal-encoded version.
John Millikin
+3  A: 

Very good question!

I took the liberty to dig into Python's source code, which is a mere command away on properly set up linux distributions (apt-get source python2.5)

Damn, John Millikin beat me to it. That's right, PyUnicode_EncodeDecimal is the answer it does this here:

/* (Loop ch in the unicode string) */
    if (Py_UNICODE_ISSPACE(ch)) {
        *output++ = ' ';
        ++p;
        continue;
    }
    decimal = Py_UNICODE_TODECIMAL(ch);
    if (decimal >= 0) {
        *output++ = '0' + decimal;
        ++p;
        continue;
    }
    if (0 < ch && ch < 256) {
        *output++ = (char)ch;
        ++p;
        continue;
    }
    /* All other characters are considered unencodable */
    collstart = p;
    collend = p+1;
    while (collend < end) {
        if ((0 < *collend && *collend < 256) ||
            !Py_UNICODE_ISSPACE(*collend) ||
            Py_UNICODE_TODECIMAL(*collend))
            break;
    }

See, it leaves all unicode code points < 256 in place, which are the latin-1 characters, based on Unicode's backward compatibility.


Addendum

With this in place, you can verify by trying other non-latin-1 characters, it will throw a different exception:

>>> float(u"ħ")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'decimal' codec can't encode character u'\u0127' in position 0: invalid decimal Unicode string
kaizer.se