views:

180

answers:

3

I am working on some code that has to manipulate unicode strings. I am trying to write doctests for it, but am having trouble. The following is a minimal example that illustrates the problem:

# -*- coding: utf-8 -*-
def mylen(word):
  """
  >>> mylen(u"áéíóú")
  5
  """
  return len(word)

print mylen(u"áéíóú")

First we run the code to see the expected output of print mylen(u"áéíóú").

$ python mylen.py
5

Next, we run doctest on it to see the problem.

$ python -m
5
**********************************************************************
File "mylen.py", line 4, in mylen.mylen
Failed example:
    mylen(u"áéíóú")
Expected:
    5
Got:
    10
**********************************************************************
1 items had failures:
   1 of   1 in mylen.mylen
***Test Failed*** 1 failures.

How then can I test that mylen(u"áéíóú") evaluates to 5?

+1  A: 

This appears to be a known and as yet unresolved issue in Python. See open issues here and here.

Not surprisingly, it can be modified to work OK in Python 3 since all strings are Unicode there:

def mylen(word):
  """
  >>> mylen("áéíóú")
  5
  """
  return len(word)

print(mylen("áéíóú"))
Ned Deily
Fair enough, this is probably the better general solution. However, in my case I am still constrained to Python 2.x due to dependencies on matplotlib and numpy.
saffsd
+2  A: 

If you want unicode strings, you have to use unicode docstrings! Mind the u!

# -*- coding: utf-8 -*-
def mylen(word):
  u"""        <----- SEE 'u' HERE
  >>> mylen(u"áéíóú")
  5
  """
  return len(word)

print mylen(u"áéíóú")

This will work -- as long as the tests pass. For Python 2.x you need yet another hack to make verbose doctest mode work or get correct tracebacks when tests fail:

if __name__ == "__main__":
    import sys
    reload(sys)
    sys.setdefaultencoding("UTF-8")
    import doctest
    doctest.testmod()

NB! Only ever use setdefaultencoding for debug purposes. I'd accept it for doctest use, but not anywhere in your production code.

kaizer.se
Thanks! This approach won't work with any package that auto-discovers tests on Python 2.x though.
saffsd
+1  A: 

My solution was to escape the unicode characters, like u'\xe1\xe9\xed\xf3\xfa'. Wasn't as easy to read though, but my tests only had a few non-ASCII characters so in those cases I put the description to the side as a comment, like "# n with tilde".

Andrew Dalke
Thanks! Unfortunately this approach breaks 'make doctest' with sphinx. It ends up with a 'utf8' codec can't decode bytes in position ...: invalid data.
saffsd
Hmmm. Well, I'm using it for my own doctests. Sorry, but I don't know what's going on here.
Andrew Dalke