views:

240

answers:

3

I was writing a setup.py for a Python package using setuptools and wanted to include a non-ASCII character in the long_description field:

#!/usr/bin/env python
from setuptools import setup
setup(...
      long_description=u"...", # in real code this value is read from a text file
      ...)

Unfortunately, passing a unicode object to setup() breaks either of the following two commands with a UnicodeEncodeError

python setup.py --long-description | rst2html
python setup.py upload

If I use a raw UTF-8 string for the long_description field, then the following command breaks with a UnicodeDecodeError:

python setup.py register

I generally release software by running 'python setup.py sdist register upload', which means ugly hacks that look into sys.argv and pass the right object type are right out.

In the end I gave up and implemented a different ugly hack:

class UltraMagicString(object):
    # Catch-22:
    # - if I return Unicode, python setup.py --long-description as well
    #   as python setup.py upload fail with a UnicodeEncodeError
    # - if I return UTF-8 string, python setup.py sdist register
    #   fails with an UnicodeDecodeError

    def __init__(self, value):
        self.value = value

    def __str__(self):
        return self.value

    def __unicode__(self):
        return self.value.decode('UTF-8')

    def __add__(self, other):
        return UltraMagicString(self.value + str(other))

    def split(self, *args, **kw):
        return self.value.split(*args, **kw)

...

setup(...
      long_description=UltraMagicString("..."),
      ...)

Isn't there a better way?

+3  A: 
#!/usr/bin/env python
# -*- coding: utf-8 -*-

from setuptools import setup
setup(name="fudz",
      description="fudzily",
      version="0.1",
      long_description=u"bläh bläh".encode("UTF-8"), # in real code this value is read from a text file
      py_modules=["fudz"],
      author="David Fraser",
      author_email="[email protected]",
      url="http://en.wikipedia.org/wiki/Fudz",
      )

I'm testing with the above code - there is no error from --long-description, only from rst2html; upload seems to work OK (although I cancel actually uploading) and register asks me for my username which I don't have. But the traceback in your comment is helpful - it's the automatic conversion to unicode in the register command that causes the problem.

See the illusive setdefaultencoding for more information on this - basically you want the default encoding in Python to be able to convert your encoded string back to unicode, but it's tricky to set this up. In this case I think it's worth the effort:

import sys
reload(sys).setdefaultencoding("UTF-8")

Or even to be correct you can get it from the locale - there's code commented out in /usr/lib/python2.6/site.py that you can find that does this but I'll leave that discussion for now.

David Fraser
I'm not sure I can paste the full traceback into a comment here; the traceback ends in /usr/lib/python2.6/distutils/command/register.py line 264 (in post_to_server) where it tries to do this: value = unicode(value).encode("utf-8"). As you can see, I'm using Python 2.6; a later version of distutils would have to be really bleeding edge stuff.
Marius Gedminas
You will note that reproduction requires you to actually have at least one non-ASCII character in the field.
Marius Gedminas
I can reproduce the 'python setup.py register' error with all three versions of Python that I have here: 2.4, 2.5 and 2.6.
Marius Gedminas
Adjusted my answer - that should help now
David Fraser
Either stackoverflow doesn't send me notifications when people do that, or I missed one. Thank you for the suggestion, the setdefaultencoding hack might actually work, if I could overcome my very strong conviction that changing the default encoding is the most evil thing you could ever do in a Python program. ;-)
Marius Gedminas
+1  A: 

You need to change your unicode long description u"bläh bläh bläh" to a normal string "bläh bläh bläh" and add an encoding header as the second line of your file:

#!/usr/bin/env python
# encoding: utf-8
...
...

Obviously, you need to save the file with UTF-8 encoding, too.

wbg
"If I use a raw UTF-8 string for the long_description field, then the following command breaks with a UnicodeDecodeError: python setup.py register"
Marius Gedminas
_Not_ a raw string (r"bläh bläh"), just a perfectly normal string in the source.It worked for me just typing the code. Make sure you're saving the file with UTF-8 encoding.You said you were loading the real long_description from a text file. It's possible you're not correctly decoding the text when you read it in from the file. Make sure to decode the text with the correct encoding for the text file.
wbg
I have similar problems to Marius. I have umlauts in a CHANGES.txt that I use for my long description. codecs.open(..., encoding=...), all the right things. But in the end, "setup.py --long-description" does a "print" and "setup.py upload" does a "unicode()". And unicode of a utf8-encoded string fails and print of a unicode string fails. RAARGH.Marius: your dirty hack works like a charm.
Reinout van Rees
I should not have used the word "raw" to refer to str objects with UTF-8 encoded data, sorry. I'm can guarantee that the file on disk is UTF-8.
Marius Gedminas
+4  A: 

It is apparently a distutils bug that has been fixed in python 2.6: http://mail.python.org/pipermail/distutils-sig/2009-September/013275.html

Tarek suggests to patch post_to_server. The patch should pre-process all values in the "data" argument and turn them into unicode and then call the original method. See http://mail.python.org/pipermail/distutils-sig/2009-September/013277.html

Reinout van Rees