views:

2606

answers:

6

I stumbled over this passage in the Django tutorial:

Django models have a default str() method that calls unicode() and converts the result to a UTF-8 bytestring. This means that unicode(p) will return a Unicode string, and str(p) will return a normal string, with characters encoded as UTF-8.

Now, I'm confused because afaik Unicode is not any particular representation, so what is a "Unicode string" in Python? Does that mean UCS-2? Googling turned up this "Python Unicode Tutorial" which boldly states

Unicode is a two-byte encoding which covers all of the world's common writing systems.

which is plain wrong, or is it? I have been confused many times by character set and encoding issues, but here I'm quite sure that the documentation I'm reading is confused. Does anybody know what's going on in Python when it gives me a "Unicode string"?

+2  A: 

Python stores Unicode as UTF-16. str() will return the UTF-8 representation of the UTF-16 string.

Jon Works
Python stores Unicode strings as UTF-16 or UTF-32, depending on the platform and compile options.
ΤΖΩΤΖΙΟΥ
On what platform does str(unicode_string) return UTF-8? Did you try it? e.g. str(u"\u0369")
ΤΖΩΤΖΙΟΥ
A: 

From Wikipedia on UTF-8:

UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII. For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages[1], and other places where characters are stored or streamed.

So, it's anywhere between one and four bytes depending on which character you wish to represent within the realm of Unicode.

From Wikipedia on Unicode:

In computing, Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in most of the world's writing systems.

So it's able to represent most (but not all) of the world's writing systems.

I hope this helps :)

Andy
A: 

so what is a "Unicode string" in Python?

Python 'knows' that your string is Unicode. Hence if you do regex on it, it will know which is character and which is not etc, which is really helpful. If you did a strlen it will also give the correct result. As an example if you did string count on Hello, you will get 5 (even if it's Unicode). But if you did a string count of a foreign word and that string was not a Unicode string than you will have much larger result. Pythong uses the information form the Unicode Character Database to identify each character in the Unicode String. Hope that helps.

Ravi Chhabra
+6  A: 

Meanwhile, I did a refined research to verify what the internal representation in Python is, and also what its limits are. "The Truth About Unicode In Python" is a very good article which cites directly from the Python developers. Apparently, internal representation is either UCS-2 or UCS-4 depending on a compile-time switch. So Jon, it's not UTF-16, but your answer put me on the right track anyway, thanks.

Hanno Fietz
+4  A: 

This is a nice slideshow about how to use unicode with python.

cnu
+19  A: 

what is a "Unicode string" in Python? Does that mean UCS-2?

Unicode strings in Python are stored internally either as UCS-2 (fixed-length 16-bit representation, almost the same as UTF-16) or UCS-4/UTF-32 (fixed-length 32-bit representation). It's a compile-time option; on Windows it's always UTF-16 whilst many Linux distributions set UTF-32 (‘wide mode’) for their versions of Python.

You are generally not supposed to care: you will see Unicode code-points as single elements in your strings and you won't know whether they're stored as two or four bytes. If you're in a UTF-16 build and you need to handle characters outside the Basic Multilingual Plane you'll be Doing It Wrong, but that's still very rare, and users who really need the extra characters should be compiling wide builds.

plain wrong, or is it?

Yes, it's quite wrong. To be fair I think that tutorial is rather old; it probably pre-dates wide Unicode strings, if not Unicode 3.1 (the version that introduced characters outside the Basic Multilingual Plane).

There is an additional source of confusion stemming from Windows's habit of using the term “Unicode” to mean, specifically, the UTF-16LE encoding that NT uses internally. People from Microsoftland may often copy this somewhat misleading habit.

bobince
Please, people, vote this answer up, even if the other chosen "answer" remains chosen.
ΤΖΩΤΖΙΟΥ
[shrug] both are correct; it's the implications of ‘len('ΤΖΩΤΖΙΟΥ')==8’ that really define what a Unicode string *is*, I suppose.
bobince
I disagree; I read the question, and it says "What is a Unicode string in Python". The chosen answer seems like a mesh of random sentences, while your answer seems much more to the point; however, this is an issue I won't further pursue. Cheers :)
ΤΖΩΤΖΙΟΥ
Hadn't revisited this post in a while, changing this to be the accepted answer. (finally...)
Hanno Fietz
I think the difference between UCS-2 and UTF-16 is at least noteworthy, since one is fixed-length and the other isn't. If I care about the internal representation at all, I want to know that.
Hanno Fietz
It really seems to be UCS-2, I'll edit that in.
Hanno Fietz
Is it really UCS-2? Since Python may handle characters > `sys.maxunicode`, only that you may happen to slice characters in the middle. With UCS-2, how would it be possible to display/store/encode/decode characters above `sys.maxunicode`? (Tested with Python 3.1)
kaizer.se
It must be UTF-16, since UCS-2 does not support surrogate pairs. Demontration on narrow build of Python 3.1, breaking a character up in surrogates: `list(chr(sys.maxunicode + 1))`. The result is `['\ud800', '\udc00']`. Can someone confirm that on (narrow) Python 2 as well?
kaizer.se
Yes, Python2 also allows a single non-BMP character to be created as two surrogate code units via `unichr` or `\U00nnnnnn` string literal escape. So technically it is using UTF-16 with UCS-2 semantics. I hate having to use the term ‘UTF-16’ though, since it can mean either a series of 16-bit code units, or the big-or-little-endian byte-based encoding of the same, which causes a whole load of confusion. In practice all ‘UCS-2’ is really ‘UTF-16’ since the latter is a more commonly used superset of the former.
bobince