views:

3315

answers:

3

I have embedded a Python interpreter in a C program. Suppose the C program reads some bytes from a file into a char array and learns (somehow) that the bytes represent text with a certain encoding (e.g., ISO 8859-1, Windows-1252, or UTF-8). How do I decode the contents of this char array into a Python string?

The Python string should in general be of type unicode—for instance, a 0x93 in Windows-1252 encoded input becomes a u'\u0201c'.

I have attempted to use PyString_Decode, but it always fails when there are non-ASCII characters in the string. Here is an example that fails:

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *py_string;

     Py_Initialize();

     py_string = PyString_Decode(c_string, 1, "windows_1252", "replace");
     if (!py_string) {
          PyErr_Print();
          return 1;
     }
     return 0;
}

The error message is UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128), which indicates that the ascii encoding is used even though we specify windows_1252 in the call to PyString_Decode.

The following code works around the problem by using PyString_FromString to create a Python string of the undecoded bytes, then calling its decode method:

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *raw, *decoded;

     Py_Initialize();

     raw = PyString_FromString(c_string);
     printf("Undecoded: ");
     PyObject_Print(raw, stdout, 0);
     printf("\n");
     decoded = PyObject_CallMethod(raw, "decode", "s", "windows_1252");
     Py_DECREF(raw);
     printf("Decoded: ");
     PyObject_Print(decoded, stdout, 0);
     printf("\n");
     return 0;
}
+3  A: 

You don't want to decode the string into a Unicode representation, you just want to treat it as an array of bytes, right?

Just use PyString_FromString:

char *cstring;
PyObject *pystring = PyString_FromString(cstring);

That's all. Now you have a Python str() object. See docs here: http://www.python.org/doc/2.5.2/api/stringObjects.html

I'm a little bit confused about how you say you want a "str or unicode." They're very different if you have non-ASCII characters. If you want to decode a C string and you know exactly what character set it's in then, yes, PyString_DecodeString is a good place to start.

Dan
I want to actually decode it, so whatever Python code ends up using the string does not need to know how it was originally encoded (in the input to the C program). Thanks for pointing that I was being unclear; I have edited my question.
Vebjorn Ljosa
+2  A: 

Try calling PyErr_Print() in the "if (!py_string)" clause. Perhaps the python exception will give you some more information.

fivebells
Thanks, I did and incorporated the information into the question.
Vebjorn Ljosa
No problem. If the advice was helpful, I'd appreciate an upvote. :-)
fivebells
+4  A: 

PyString_Decode does this:

PyObject *PyString_Decode(const char *s,
        Py_ssize_t size,
        const char *encoding,
        const char *errors)
{
    PyObject *v, *str;

    str = PyString_FromStringAndSize(s, size);
    if (str == NULL)
    return NULL;
    v = PyString_AsDecodedString(str, encoding, errors);
    Py_DECREF(str);
    return v;
}

IOW, it does basically what you're doing in your second example - converts to a string, then decode the string. The problem here arises from PyString_AsDecodedString, rather than PyString_AsDecodedObject. PyString_AsDecodedString does PyString_AsDecodedObject, but then tries to convert the resulting unicode object into a string object with the default encoding (for you, looks like that's ASCII). That's where it fails.

I believe you'll need to do two calls - but you can use PyString_AsDecodedObject rather than calling the python "decode" method. Something like:

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *py_string, *py_unicode;

     Py_Initialize();

     py_string = PyString_FromStringAndSize(c_string, 1);
     if (!py_string) {
          PyErr_Print();
          return 1;
     }
     py_unicode = PyString_AsDecodedObject(py_string, "windows_1252", "replace");
     Py_DECREF(py_string);

     return 0;
}

I'm not entirely sure what the reasoning behind PyString_Decode working this way is. A very old thread on python-dev seems to indicate that it has something to do with chaining the output, but since the Python methods don't do the same, I'm not sure if that's still relevant.

Tony Meyer
Opps! Thanks Ljosa; fixed.
Tony Meyer