Is it considered as a good practice to pick Unicode string over regular string when coding in Python? I mainly work on the Windows platform, where most of the string types are Unicode these days (i.e. .NET String, '_UNICODE' turned on by default on a new c++ project, etc ). Therefore, I tend to think that the case where non-Unicode string objects are used is a sort of rare case. Anyway, I'm curious about what Python practitioners do in real-world projects.
From my practice -- use unicode.
At beginning of one project we used usuall strings, however our project was growing, we were implementing new features and using new third-party libraries. In that mess with non-unicode/unicode string some functions started failing. We started spending time localizing this problems and fixing them. However, some third-party modules doesn't supported unicode and started failing after we switched to it (but this is rather exclusion than a rule).
Also I have some experience when we needed to rewrite some third party modules(e.g. SendKeys) cause they were not supporting unicode. If it was done in unicode from beginning it will be better :)
So I think today we should use unicode.
P.S. All that mess upwards is only my hamble opinion :)
If you are dealing with severely constrained memory or disk space, use ASCII strings. In this case, you should additionally write your software in C or something even more compact :)
As you ask this question, I suppose you are using Python 2.x.
Python 3.0 changed quite a lot in string representation, and all text now is unicode.
I would go for unicode in any new project - in a way compatible with the switch to Python 3.0 (see details).
Additional to Mihails comment I would say: Use Unicode, since it is the future. In Python 3.0, Non-Unicode will be gone, and as much I know, all the "U"-Prefixes will make trouble, since they are also gone.
It can be tricky to consistently use unicode strings in Python 2.x - be it because somebody inadvertently uses the more natural str(blah)
where they meant unicode(blah)
, forgetting the u
prefix on string literals, third-party module incompatibilities - whatever. So in Python 2.x, use unicode only if you have to, and are prepared to provide good unit test coverage.
If you have the option of using Python 3.x however, you don't need to care - strings will be unicode with no extra effort.
Yes, use unicode.
Some hints:
When doing input output in any sort of binary format, decode directly after reading and encode directly before writing, so that you never need to mix strings and unicode. Because mixing that tends to lead to UnicodeEncodeDecodeErrors sooner or later.
Common Python newbie errors with Unicode (not saying you are a newbie, but this may be read by newbies): Don't confuse encode and decode. Remember, UTF-8 is an ENcoding, so you ENcode Unicode to UTF-8 and DEcode from it.
Do not fall into the temptation of setting the default encoding in Python to whatever you use most. That is just going to give you problems if you reinstall or move to another computer or suddenly need to use another encoding.
Remember, not all of Python 2s standard library accepts unicode. If you feed a method unicode and it doesn't work, but it should, try feeding it ascii and see. Examples: urllib.urlopen(), which fails with unhelpful errors if you give it a unicode object instead of a string.
Hm. That's all I can think of now!