views:

419

answers:

4

I see some frameworks like Django using unicode all over the place so it seems like it might be a good idea.

On the other hand, it seems like a big pain to have all these extra 'u's floating around everywhere.

What will be a problem if I don't do this?

Are there any issues that will come up if I do do this?

I'm using Pylons right now as my framework.

+10  A: 

In Python 3, all strings are Unicode. So, you can prepare for this by using u'' strings everywhere you need to, and then when you eventually upgrade to Python 3 using the 2to3 tool all the us will disappear. And you'll be in a better position because you will have already tested your code with Unicode strings.

See Text Vs. Data Instead Of Unicode Vs. 8-bit for more information.

Greg Hewgill
+14  A: 

You can avoid the u'' in python 2.6 by doing:

from __future__ import unicode_literals

That will make 'string literals' to be unicode objects, just like it is in python 3;

nosklo
Awesome. Super useful tip.
docgnome
+1 It's too bad this can't be combined with the selected answer. They both are the 'best' answer to address this issue.
Evan Plaice
+2  A: 

What will be a problem if I don't do this?

I'm a westerner living in Japan, so I've seen first-hand what is needed to work with non-ASCII characters. The problem if you don't use Unicode strings is that your code will be a frustration to the parts of the world that use anything other than A-Z. Our company has had a great deal of frustration getting certain web software to do Japanese characters without making a total mess of it.

It takes a little effort for English speakers to appreciate how great Unicode is, but it really is a terrific bit of work to make computers accessible to all cultures and languages.

"Gotchas":

  1. Make sure your output web pages state the encoding in use properly (e.g. using content-encoding header), and then encode all Unicode strings properly at output. Python 3 Unicode strings is a great improvement to do this right.

  2. Do everything with Unicode strings, and only convert to a specific encoding at the last moment, when doing output. Other languages, such as PHP, are prone to bugs when manipulating Unicode in e.g. UTF-8 form. Say you have to truncate a Unicode string. If it's in UTF-8 form internally, there's a risk you could chop off a multi-byte character half-way through, resulting in rubbish output. Python's use of Unicode strings internally makes it harder to make these mistakes.

Craig McQueen
Yep. If you plan to do *any* kind of text manipulation (e.g. changing capitalization, chopping words into letters), use Python's unicode objects or you'll feel pain.
Marius Gedminas
A: 

Using Unicode internally is a good way to avoid problems with non-ASCII characters. Convert at the boundaries of your application (incoming data to unicode, outgoing data to UTF-8 or whatever). Pylons can do the conversion for you in many cases: e.g. controllers can safely return unicode strings; SQLAlchemy models may declare Unicode columns.

Regarding string literals in your source code: the u prefix is usually not necessary. You can safely mix str objects containing ASCII with unicode objects. Just make sure all your string literals are either pure ASCII or are u"unicode".

Marius Gedminas