To follow best practices for Unicode in python, you should prefix all string literals of characters with 'u'. Is there any tool available (preferably PyDev compatible) that warns if you forget it?
you should prefix all string literals with 'u'
No, not really.
You should prefix literals for strings of characters with u
. But not all strings are strings of characters. When you are talking to components that are byte based, like network services, or binary files, you need to be using byte strings.
eg. Want to try to write a Unicode string into a PNG file? Not sensible. Want to base64-decode the string Y2Fm6Q==
? You can't reasonably use a Unicode string here, base64 is explicitly bytes.
Sure, Python will often let you get away with passing a unicode string where a byte string is expected, but only by automatically encoding to ASCII. If the string contains non-ASCII characters you going to get UnicodeError
just as surely as if you'd used bytes where unicode was expected. “Unicode is right, bytes are wrong” is a damaging myth. Manipulation for both kinds of strings are required.
If you are concerned about the transition to Python 3, you should certainly mark up your character strings as u''
, but you should then also mark up your explicitly-bytes strings as b''
. Strings where it doesn't matter you can leave as ''
and let them get converted from byte strings to unicode strings on Python 3. There are lots of cases where Python 2 used to use bytes and Python 3 uses Unicode where it is appropriate to do this. But there are still plenty of cases where you do really need to be talking bytes, and having that converted to Python 3 as unicode will cause problems.
(The only problem with this is that b''
syntax requires Python 2.6 or later, so using it will make you incompatible with earlier versions.)
KennyTM's comment should be posted as an answer:
from __future__ import unicode_literals
This future declaration can be used in Python 2.6 and 2.7 and enables Python 3's string syntax so that unprefixed string literals are Unicode strings and byte arrays require a b
prefix.