views:

337

answers:

2

Guido van Rossum's presentation about Python 3000 mentions several things to make a transition from Python 2 to Python 3 easier eventually. He is specifically talking about text handling since the move to Unicode as the only representation of strings in Python 3 is one of the major changes.

As far as text handling goes, one slide (#14) says:

  • In 2.6:
    • Use bytes and b'…' for all data (Knowing these are just aliases for str and '…')
    • Use unicode and u'...' for all text
  • In 2.5:
    • '...' for data, u'...' for text

I am using Python 2.6.4. What exactly does this mean for me?

In Python's world, what is the difference between data and text?

+9  A: 

In a nutshell, the way text and data is handled in Py3k may arguably be the most "breaking" change in the language. By knowing and avoiding,when possible, the situations where some Python 2.6 logic will work differently than in 3.x, we can facilitate the migration when it happens. Yet we should expect that some parts of the 2.6 logic may require special attention and modifications for example to deal with distinct encodings etc.

The idea behind BDFL's suggestion on slide 14 is probably to start "using" the same types which Py3k supports (and only these), namely unicode strings for strings (str type) and 8-bits byte sequences for "data" (bytes type).

The term "using" in the previous sentence is used rather loosely since the semantics and associated storage/encoding for these types differs between the 2.6 and 3.x versions. In Python 2.6, the bytes type and the associated literal syntax (b'xyz') simply map to the str type. Therefore

# in Py2.6
>>'mykey' == b'mykey'
True
b'mykey'.__class__
<class 'str'>

# in Py3k
>>>'mykey' == b'mykey'
False
b'mykey'.__class__
<class 'bytes'>

To answer your question [in the remarks below], in 2.6 whether you use b'xyz' or 'xyz', Python understands it as the same and one thing : an str. What is important is that you understand these as [potentially/in-the-future] two distinct types with a distinct purpose:

  • str for text-like info, and
  • bytes for sequences of octets storing whatever data at hand.

For example, again speaking close to your example/question, in Py3k you'll be able to have a dictionary with two elements which have a similar keys, one with b'mykey' and the other with 'mykey', however under 2.6 this is not possible, since these two keys are really the same; what matters is that you know this kind of things and avoid (or explicitly mark in a special fashion in the code) the situations where the 2.6 code will not work in 3.x.

In Py3k, str is an abstract unicode string, a sequence of unicode code points (characters) and Python deals with converting this to/from its encoded form whatever the encoding might be (as a programmer you do have a say about the encoding but at the time you deal with string operations and such you do not need to worry about these details). In contrast, bytes is a sequence of 8-bits "things" which semantics and encoding are totally left to the programmer.

So, even though Python 2.6 doesn't see a difference, by explicitly using bytes() / b'...' or str() / u'...', you...

  • ... prepare yourself and your program to the upcoming types and semantics of Py3k
  • ... make it easier for the automatic conversion (2to3 tool or other) of the source code, whereby the b in b'...' will remain and the u of u'...' will be removed (since the only string type will be unicode).

For more info:
Python 2.6 What's new (see PEP 3112 Bytes Literals)
Python 3.0 What's New (see Text Vs. Data Instead Of Unicode Vs. 8-bit near the top)

mjv
It's also good practice to make you really think about whether you mean a string of characters or a string of bytes! Only problem is of course you lose compatibility with Python 2.5 and earlier.
bobince
I understand *why* it is a good idea to do this, but I am still not clear as to *how* exactly. Unless a function explicitly expects a string of bytes, when do I use a string of bytes vs a string of characters? my_dict[b'mykey'] or my_dict[u'mykey']? When is it considered data, when is it considered text?
cschol
Thank you for the detailed update.
cschol
@cschol: I would suggest it's always a (unicode) string unless it's not. That is, always use unicode strings. When you find that that doesn't work for a specific problem, use the string of bytes.
Bryan Oakley
+2  A: 

The answer to your first question is simple: In Python 2.6 you can do has you used to. But, if you like, you can switch to Py3k standards by typing:

from __future__ import unicode_literals

Your second question needs more clarification:

Strings are data that prints as human characters. Not only in Python, but every language (I know of) has its way when dealing with strings.

However, the common grounds are encodings. Encodings are the way to map byte sequences to glyphs (ie. mostly printable symbols).

Python offers a simple way to overcome the complexities of managing encodings (when you put string literals in your code).

Let's see a very simple example:

>>> len("Mañana")
7

I only see 6 symbols. So I expect len would have returned 6. Where is this extra "symbol" coming from? Well in UTF-8 the symbol ñ is represented with 2 bytes. Before Py3k, string literals are just sequences of bytes. So, Python sees that string as bytes and it counts them all: Ma\xc3\xb1ana.

However, if I execute the following:

>>> len(u"Mañana")
6

So Python "knows" exactly that the 2-bytes sequences for "ñ" is to be considered as a single letter.

This is by no means exclusive to Python. The following PHP script shows the same behavior:

manu@pavla:~$ php <<EOF
<?php
echo strlen("Mañana")."\n";
?>
EOF
7

The PHP solution happens to be more elaborate:

manu@pavla:~$ php <<EOF
<?php
echo mb_strlen("Mañana", "utf-8")."\n";
?>
EOF
6

Notice I have to substitute mb_strlen for strlen and I have to pass utf-8 (the encoding) as a second argument.

A word of warning: user provided strings come usually as bytes, not unicode strings. So you need to take care of that. See more on http://mail.python.org/pipermail/python-list/2008-July/139193.html

manu
"Encodings are the way to map byte sequences to glyphs (i.e mostly printable symbols)". That isn't quite right. They are mappings of byte sequences to *characters*, whether or not those characters are ever printed. Mapping characters to glyphs is another process entirely, dependent on the font system.
quark
@quark. I agree. I was trying to simplify the discourse of the response which is quite large.
manu