views:

64

answers:

3

How can I automate a test to enforce that a body of Python 2.x code contains no string instances (only unicode instances)?

Eg.

Can I do it from within the code?

Is there a static analysis tool that has this feature?

+1  A: 

It seems to me like you really need to parse the code with an honest to goodness python parser. Then you will need to dig through the AST your parser produces to see if it contains any string literals.

It looks like Python comes with a parser out of the box. From this documentation I got this code sample working:

import parser
from token import tok_name

def checkForNonUnicode(codeString):
    return checkForNonUnicodeHelper(parser.suite(codeString).tolist())

def checkForNonUnicodeHelper(lst):
    returnValue = True
    nodeType = lst[0]
    if nodeType in tok_name and tok_name[nodeType] == 'STRING':
        stringValue = lst[1]
        if stringValue[0] != "u": # Kind of hacky. Does this always work?
            print "%s is not unicode!" % stringValue
            returnValue = False

    else:
        for subNode in [lst[n] for n in range(1, len(lst))]:
            if isinstance(subNode, list):
                returnValue = returnValue and checkForNonUnicodeHelper(subNode)

    return returnValue

print checkForNonUnicode("""
def foo():
    a = 'This should blow up!'
""")
print checkForNonUnicode("""
def bar():
    b = u'although this is ok.'
""")

which prints out

'This should blow up!' is not unicode!
False
True

Now doc strings aren't unicode but should be allowed, so you might have to do something more complicated like from symbol import sym_name where you can look up which node types are for class and function definitions. Then the first sub-node that's simply a string, i.e. not part of an assignment or whatever, should be allowed to not be unicode.

Good question!

Edit

Just a follow up comment. Conveniently for your purposes, parser.suite does not actually evaluate your python code. This means that you can run this parser over your Python files without worrying about naming or import errors. For example, let's say you have myObscureUtilityFile.py that contains

from ..obscure.relative.path import whatever

You can

checkForNonUnicode(open('/whoah/softlink/myObscureUtilityFile.py').read())
Dave Aaron Smith
You don't need to parse the code. Just producing the lexemes should be enough; if any lexeme isn't Unicode your file has failed the test. If your file contains "outside references" (e.g., from_future), then you can't know without parsing *all* the files involved, but I suspect this isn't part of your problem definition.
Ira Baxter
A: 

You can't enforce that all strings are Unicode; even with from __future__ import unicode_literals in a module, byte strings can be written as b'...', as they can in Python 3.

There was an option that could be used to get the same effect as unicode_literals globally: the command-line option -U. However it was abandoned early in the 2.x series because it basically broke every script.

What is your purpose for this? It is not desirable to abolish byte strings. They are not “bad” and Unicode strings are not universally “better”; they are two separate animals and you will need both of them. Byte strings will certainly be needed to talk to binary files and network services.

If you want to be prepared to transition to Python 3, the best tack is to write b'...' for all the strings you really mean to be bytes, and u'...' for the strings that are inherently Unicode. The default string '...' format can be used for everything else, places where you don't care and/or whether Python 3 changes the default string type.

bobince
The purpose is that I'm writing a multilingual application which has to be in 2.5, and I keep forgetting to type the `u` on strings that happen not to need unicode but may do if they get edited. I understand that regular strings are fine in many cases, but in this case I need something to help me be consistent and clearly express my intention.
Ian Mackinnon