Should I use Unicode string by default?

views:

555

answers:

+10 Q:

Should I use Unicode string by default?

Is it considered as a good practice to pick Unicode string over regular string when coding in Python? I mainly work on the Windows platform, where most of the string types are Unicode these days (i.e. .NET String, '_UNICODE' turned on by default on a new c++ project, etc ). Therefore, I tend to think that the case where non-Unicode string objects are used is a sort of rare case. Anyway, I'm curious about what Python practitioners do in real-world projects.

+11 A:

From my practice -- use unicode.

At beginning of one project we used usuall strings, however our project was growing, we were implementing new features and using new third-party libraries. In that mess with non-unicode/unicode string some functions started failing. We started spending time localizing this problems and fixing them. However, some third-party modules doesn't supported unicode and started failing after we switched to it (but this is rather exclusion than a rule).

Also I have some experience when we needed to rewrite some third party modules(e.g. SendKeys) cause they were not supporting unicode. If it was done in unicode from beginning it will be better :)

So I think today we should use unicode.

P.S. All that mess upwards is only my hamble opinion :)

Mihail 2009-07-12 17:31:49

+1: always use unicode when you are handling text. Whenever the need arises to treat the text data as bytes (for instance when moving over network or writing to disk) - convert the unicode to a sequence of bytes (represented as a string in Python). Convert by calling encode or unicode.

codeape 2009-07-22 13:31:38

+1 A:

If you are dealing with severely constrained memory or disk space, use ASCII strings. In this case, you should additionally write your software in C or something even more compact :)

Jeff Ober 2009-07-12 17:38:16

+9 A:

As you ask this question, I suppose you are using Python 2.x.

Python 3.0 changed quite a lot in string representation, and all text now is unicode.
I would go for unicode in any new project - in a way compatible with the switch to Python 3.0 (see details).

Roberto Liffredo 2009-07-12 17:59:19

Yeah, future compatibility is quite important!

Mihail 2009-07-12 18:17:38

It's good to know what's coming in Python3, which I have not investigated yet. Thanks!

Kei 2009-07-12 19:25:30

+4 A:

Additional to Mihails comment I would say: Use Unicode, since it is the future. In Python 3.0, Non-Unicode will be gone, and as much I know, all the "U"-Prefixes will make trouble, since they are also gone.

Juergen 2009-07-12 17:59:48

+4 A:

It can be tricky to consistently use unicode strings in Python 2.x - be it because somebody inadvertently uses the more natural str(blah) where they meant unicode(blah), forgetting the u prefix on string literals, third-party module incompatibilities - whatever. So in Python 2.x, use unicode only if you have to, and are prepared to provide good unit test coverage.

If you have the option of using Python 3.x however, you don't need to care - strings will be unicode with no extra effort.

romkyns 2009-07-12 18:15:26

+7 A:

Yes, use unicode.

Some hints:

When doing input output in any sort of binary format, decode directly after reading and encode directly before writing, so that you never need to mix strings and unicode. Because mixing that tends to lead to UnicodeEncodeDecodeErrors sooner or later.
Common Python newbie errors with Unicode (not saying you are a newbie, but this may be read by newbies): Don't confuse encode and decode. Remember, UTF-8 is an ENcoding, so you ENcode Unicode to UTF-8 and DEcode from it.
Do not fall into the temptation of setting the default encoding in Python to whatever you use most. That is just going to give you problems if you reinstall or move to another computer or suddenly need to use another encoding.
Remember, not all of Python 2s standard library accepts unicode. If you feed a method unicode and it doesn't work, but it should, try feeding it ascii and see. Examples: urllib.urlopen(), which fails with unhelpful errors if you give it a unicode object instead of a string.

Hm. That's all I can think of now!

Lennart Regebro 2009-07-12 18:54:32

Point 3 is so true - everybody I know (including me) made this error, and not only once!

Roberto Liffredo 2009-07-12 19:05:52

Re: "encode directly after writing" -- can you clarify? I think that should be "before" instead of "after", but I might be missing your point.

ars 2009-07-12 20:54:14

@Lennart: "Note that even if you after encode unicode into a string full of non-ascii text, this is still text, according to Python." ... In 3.x, str.encode() returns type bytes, and the ascii or not distinction seems irrelevant; what is the point that you are trying to make?

John Machin 2009-07-13 00:03:54

@ars: Correct, fixed that. @John: Ehhh, I think I only made things more confusing. It's not an important point, so I removed it.

Lennart Regebro 2009-07-13 08:33:13

ansaurus

tags:

views:

answers:

Should I use Unicode string by default?

related questions