views:

108

answers:

1

When profiling our code I was surprised to find millions of calls to
C:\Python26\lib\encodings\utf_8.py:15(decode)

I started debugging and found that across our code base there are many small bugs, usually comparing a string to a unicode or adding a sting and a unicode. Python graciously decodes the strings and performs the following operations in unicode.

How kind. But expensive!

I am fluent in unicode, having read Joel Spolsky and Dive Into Python...

I try to keep our code internals in unicode only.

My question - can I turn off this pythonic nice-guy behavior? At least until I find all these bugs and fix them (usually by adding a u'u')?

Some of them are extremely hard to find (a variable that is sometimes a string...).

Python 2.6.5 (and I can't switch to 3.x).

+3  A: 

The following should work:

>>> import sys
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding('undefined')
>>> u"abc" + u"xyz"
u'abcxyz'
>>> u"abc" + "xyz"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/encodings/undefined.py", line 22, in decode
    raise UnicodeError("undefined encoding")
UnicodeError: undefined encoding

reload(sys) in the snippet above is only necessary here since normally sys.setdefaultencoding is supposed to go in a sitecustomize.py file in your Python site-packages directory (it's advisable to do that).

ChristopheD
Oh wow. I love it. Can you explain a little more how the `reload()` does its magic? How and why does it nullify the `sitecustomize.py` setting?
jcdyer
On my Apple Python 2.6 build (but I've seen this elsewhere...) `site.py` (in your std python lib dir; executed once automagically at Python startup) contains (near the end): `if hasattr(sys, "setdefaultencoding"): del sys.setdefaultencoding`. This makes this attribute unavailable on `sys` unless you explicitely choose to `reload(sys)` (or uncomment the deleting). It used to be available directly in earlier Pythons iirc.
ChristopheD
Very cool - thank you!Pydev and Pylint hate you, but it works!...and I found a truckload of "bugs" in a few minutes, some of them in the Python source code! (They are not exactly bugs because the code works, it just works a little better after I fix it).CSV files: split(u'\t') needed the little 'u'. Dictionary keys are not exactly unicode in 2.6... - who would have thunk?!?!Thank you!
Tal Weiss