ansaurus

Question

What is internal representation of string in Python 3.x

Answer 1

A:

It depends: see here. This is still true for Python 3 as far as internal representation goes.

Ned Deily 2009-12-03 07:12:22

Answer 2

+1 A:

I think, Its hard to judge difference between UTF-16, which is just a sequences of 16 bit words, to Python's string object.

And If python is compiled with Unicode=UCS4 option, it will be comparing between UTF-32 and Python string.

So, better consider, they are in different category, although you can transform each others.

S.Mark 2009-12-03 07:18:44

Answer 3

+2 A:

There has been NO CHANGE in Unicode internal representation between Python 2.X and 3.X.

It's definitely NOT UTF-16. UTF-anything is a byte-oriented EXTERNAL representation.

Each code unit (character, surrogate, etc) has been assigned a number from range(0, 2 ** 21). This is called its "ordinal".

Really, the documentation you quoted says it all. Most Python binaries use 16-bit ordinals which restricts you to the Basic Multilingual Plane ("BMP") unless you want to muck about with surrogates (handy if you can't find your hair shirt and your bed of nails is off being de-rusted). For working with the full Unicode repertoire, you'd prefer a "wide build" (32 bits wide).

Briefly, the internal representation in a unicode object is an array of 16-bit unsigned integers, or an array of 32-bit unsigned integers (using only 21 bits).

John Machin 2009-12-03 07:37:52

"Storing the unicode codeponts in 16 bit integers" is called "UCS-2". Doing the same thing with 32 bit integers is UCS-4.

Joachim Sauer 2009-12-03 09:36:16

I'm not sure how saying that the process is called "UCS2" or "garbelfratzing" or whatever is helping the OP's understanding.

John Machin 2009-12-03 19:55:28

calling something by its right name gives you something to label your new understanding with and sort of.. keep it until you encounter again. We can't talk without words.

kaizer.se 2009-12-03 22:54:19

Answer 4

+1 A:

Looking at the source code:

unicodeobject.h:

/* --- Unicode Type ------------------------------------------------------- */

typedef struct {
    PyObject_HEAD
    Py_ssize_t length;   /* Length of raw Unicode data in buffer */
    Py_UNICODE *str;     /* Raw Unicode buffer */
    long hash;    /* Hash value; -1 if not set */
    int state;    /* != 0 if interned. In this case the two
            * references from the dictionary to this object
            * are *not* counted in ob_refcnt. */
    PyObject *defenc;    /* (Default) Encoded version as Python
          string, or NULL; this is used for
          implementing the buffer protocol */
} PyUnicodeObject;

The characters are stored as an array of Py_UNICODE. On most platforms, I believe Py_UNICODE is #defined as wchar_t.

codeape 2009-12-03 09:25:36

ansaurus

tags:

views:

answers:

What is internal representation of string in Python 3.x

related questions