views:

213

answers:

6

i know that django uses unicode strings all over the framework instead of normal python strings. what encoding are normal python strings use ? and why don't they use unicode?

+1  A: 

From Python 3.0 on all strings are unicode by default, there is also the bytes datatype (Python documentation).

So the python developers think that using unicode is a good idea, that it is not used universally in python 2 is mostly due to backwards compatibility. It also has performance implications.

Fabian
Python 2 and Python 3 have the same exact level of unicode support and they both have two string types having the same semantics as the two in the other.
Mike Graham
But the default changed to unicode strings. That was the only thing I wanted to say there (somewhat ambiguously, I confess).
Fabian
The syntax changed so that unicode literals didn't have a character in front of their quote marks and bytestrings did, right.
Mike Graham
3.x `bytes` doesn't quite have the same semantics as 2.x `str`.
dan04
@dan04 Please could you give an example of how the semantics differ?
blokeley
@blokeley, the biggest things are that iterating over or indexing Py3 `bytes` gives you ints instead of length-1 `bytes` instances, autoconversion is gone (Thank god), `bytes` and `str` no longer share a base class.
Mike Graham
A: 

Before Python 3.0, string encoding was ascii by default, but could be changed. Unicode string literals were u"...". This was silly.

katrielalex
Maybe silly, but necessary as an interim step between all strings being ASCII and all strings being UNICODE.
S.Lott
Oh, I don't doubt that this was the best way to do it. It just made for the rather odd situation where you had to `u` all your strings (which slightly begged the question what were the non-`u`-ed ones?)! =p
katrielalex
@S.Lott, all strings being ASCII doesn't properly describe the situation Python's ever been in.
Mike Graham
@Mike Graham: That's a weird statement. Rather than say what's wrong, could you provide a correction? The original str data type was formally restricted to ASCII. That's what it said in the docs. What are you claiming?
S.Lott
@S.Lott, The `str` data type represents a sequence of bytes. These bytes are not restricted to being 0-127 (the original values of ASCII) or to semantically referring to ASCII text, nor were they ever on both counts.
Mike Graham
So, instead of ASCII, I should write the words "ISO/IEC 8859"? Is that what your claiming? That be "better" in some way to introduce this subtlety?
S.Lott
No; they're arbitrary bytes, which *might* semantically map to a sequence of code points through some encoding, but that doesn't have to be the case. They're just bytes. Calling them ISO/IEC 8859 isn't more correct than calling them ASCII.Bytestrings can just as well contain stuff encoded in UTF-8, Shift-JIS, or they could be the binary representation of a float, an integer... They're just bytes.What is correct is that the default *source* encoding for 2.x is ASCII.
lvh
+2  A: 

Python 2.x strings are 8-bit, nothing more. The encoding may vary (though ASCII is assumed). I guess the reasons are historical. Few languages, especially languages that date back to the last century, use unicode right away.

In Python 3, all strings are unicode.

delnan
Quite right: `str` does not have an encoding, it's just bytes which can be used for data that is text of any encoding. (Incidentally, though, both Python 2 and 3 have unicode and byte strings. In Python 3 they are `str` and `bytes` respectively and in Python 2 they are `unicode` and `str` respectively.)
Mike Graham
@delnan: FWIW Tcl uses unicode internally for all strings, and has done so for over a decade (since version 8.1, circa 1999). There is no unicode string type and non-unicode string type, everything is unicode.
Bryan Oakley
@Bryan, Indeed, and the issue of encoding is pushed off to channels. This is sort of good and conceivably a better design, but also can be less flexible.
Mike Graham
+11  A: 

Normal Python strings (Python 2.x str) don't have an encoding: they are raw data. In Python 3 these are called "bytes" which is an accurate description, as they are simply sequences of bytes, which can be text encoded in any encoding (several are common!) or non-textual data altogether.

For representing text, you want unicode strings, not byte strings. unicode instances are sequences of unicode codepoints represented abstractly without an encoding; this is well-suited for representing text.

Bytestrings are important because to represent data for transmission over a network or writing to a file or whatever, you cannot have an abstract representation of unicode, you need a concrete representation of bytes. Though they are often used to store and represent text, this is at least a little naughty.

This whole situation is complicated by the fact that while you should turn unicode into bytes by calling encode and turn bytes into unicode using decode, Python will try to do this automagically for you using a global encoding you can set that is by default ASCII, which is the safest choice. Never depend on this for your code and never ever change this to a more flexible encoding--explicitly decode when you get a bytestring and encode if you need to send a string somewhere external.

Mike Graham
+9  A: 

Hey! I'd like to add some stuff to other answers, unfortunately I don't have enough rep yet to do that properly :-(

FWIW, Mike Graham's post is pretty good and that's probably what you should be reading first.

Here's a few comments:

  1. The need to prefix unicode literals with "u" in 2.x is pretty easily removed in recent (2.6+) 2.x Pythons. from __future__ import unicode_literals
  2. Simialrly, ASCII is only the default source encoding. Python understands a variety of coding hints including the emacs-style # -*- coding: utf-8 -*-. For more information see PEP 0263. Changing the source encoding affects how Unicode literals (regardless of their prefix or lack of prefix, as affected by point 1) are interpreted. In Py3k, the default file encoding is UTF-8.
  3. Python of course does use an encoding internally for Unicode strings (str in py3k, unicode in 2.x) because at some point in time stuff's going to have to be written to memory. Ideally, this would never be evident to the end-user. Unfortunately nothing's perfect and you can occasionally run into problems with this: specifically if you use funky squiggles outside of the Unicode Base Multilingual Plane. Since Python 2.2, we've had what's called wide builds and narrow builds; these names refer to the type used internally to store Unicode code points. Wide builds use UCS-4, which uses 4 bytes to store a Unicode code point. (This means UCS-4's code unit size is 4 bytes, or 32 bits.) Narrow builds use UCS-2. UCS-2 only has 16 bits, and therefore can not encode all Unicode code points accurately (it's like UTF-16, except without the surrogate pairs). To check, test the value of sys.maxunicode. If it's 1114111, you've got a wide build (which can correctly represent all of Unicode). If it's less, well, don't fret too much. The BMP (code points 0x0000 to 0xFFFF) covers most people's needs. For more information, see PEP 0261.
lvh
*Narrow* builds use UTF-16 (also note that UCS-2 and UTF-16 are considered as synonyms in wikipedia; I used to think they are different, just like you do), with surrogate pairs and all. See here: http://codepad.org/RjuAeWFK . So please edit your answer.
ΤΖΩΤΖΙΟΥ
Huh? The Wikipedia page says they are *not* equivalent. In fact it specifically states that the difference is that it's fixed width and does not support surrogate pairs (that's really saying the same thing twice). Quoted from there: The older UCS-2 (2-byte Universal Character Set) standard is a similar character encoding that was superseded by UTF-16 in Unicode version 2.0, though it still remains in use. UCS-2 is fixed length and always encodes characters into a single 16-bit code unit. It does not support surrogate pairs and can only encode characters in the BMP range U+0000 through U+FFFF.
lvh
Second comment because I couldn't fit it in one. Although UCS-2 and UTF-16 *are* distinct things, it's not entirely clear-cut what Python uses internally on narrow builds. Quote from Thomas Wouters: 01:57 <Yhg1s> well, it's called UCS-2 because it doesn't treat the surrogates as a singlecharacter... but it's also UTF-16 because it *has* surrogates :) -- The behavior you are seeing is a consequence of the latter.
lvh
Finally, from the Wikipedia page: Because of the technical similarities and upwards compatibility from UCS-2 to UTF-16, the two are often erroneously conflated and used as if interchangeable, so that strings encoded in UTF-16 are sometimes misidentified as being encoded in UCS-2. -- I think that makes it obvious that they are quite the opposite of "considered synonyms".
lvh
A: 

what encoding are normal python strings use?

In Python 3.x

str is Unicode. This may be either UTF-16 or UTF-32 depending on whether your Python interpreter was built with "narrow" or "wide" Unicode characters.

The Windows version of CPython uses UTF-16. On Unix-like systems, UTF-32 tends to be preferred.

In Python 2.x

str is a byte string type like C char. The encoding isn't defined by the language, but is whatever your locale's default encoding is. Or whatever the MIME charset of the document you got off the Internet is. Or, if you get a string from a function like struct.pack, it's binary data, and doesn't meaningfully have a character encoding at all.

unicode strings in 2.x are equivalent to str in 3.x.

and why don't they use unicode?

Because Python (slightly) predates Unicode. And because Guido wanted to save all the major backwards-incompatible changes for 3.0. Strings in 3.x do use Unicode by default.

dan04
What's the downvote for?
dan04
-1 "On Windows, strings are always UTF-16" is utter codswallop. You mean something like: Windows CPython binaries are usually provided as a "narrow" (16-bit) Unicode implementation, with minimal support via surrogates for code points outside the BMP. One may compile a "wide" (32-bit) exe if required. Python 2.6: your rant refers to `str` objects and completely ignores `unicode` objects.
John Machin
Well, the OP *did* specifically ask about "normal" strings.
dan04
@For people writing internationally-usable systems on 2.x, `unicode` is normal
John Machin
Minor technical nitpick. Narrow builds use UCS-2, not UTF-16. The critical difference is that they can not accurately represent code points (ie, as a single code point) code points that would be encoded using a surrogate pair in UTF-16.
lvh