views:

231

answers:

1

The following sample code:

import token, tokenize, StringIO

def generate_tokens(src):
    rawstr = StringIO.StringIO(unicode(src))
    tokens = tokenize.generate_tokens(rawstr.readline)
    for i, item in enumerate(tokens):
        toktype, toktext, (srow,scol), (erow,ecol), line = item
        print i, token.tok_name[toktype], toktext

s = \
"""
 def test(x):
     \"\"\" test with an unterminated docstring
"""

generate_tokens(s)

causes the following to fire:

... (stripped a little)
File "/usr/lib/python2.6/tokenize.py", line 296, in generate_tokens
    raise TokenError, ("EOF in multi-line string", strstart)
tokenize.TokenError: ('EOF in multi-line string', (3, 5))

Some questions about this behaviour:

  1. Should I catch and 'selectively' ignore tokenize.TokenError here? Or should I stop trying to generate tokens from non-compliant/non-complete code? If so, how would I check for that?
  2. Can this error (or similar errors) be caused by anything other than an unterminated docstring?
+1  A: 

How you handle tokenize errors depends entirely on why you are tokenizing. You code gives you all the valid tokens up until the beginning of the bad string literal. If that token stream is useful to you, then use it.

You have a few options about what to do with the error:

  1. You could ignore it and have an incomplete token stream.

  2. You could buffer all the tokens and only use the token stream if no error occurred.

  3. You could process the tokens, but abort the higher-level processing if an error occurred.

As to whether that error can happen with anything other than an incomplete docstring, yes. Remember that docstrings are just string literals. Any unterminated multi-line string literal will give you the same error. Similar errors could happen for other lexical errors in the code.

For example, here are other values of s that produce errors (at least with Python 2.5):

s = ")"  # EOF in multi-line statement
s = "("  # EOF in multi-line statement
s = "]"  # EOF in multi-line statement
s = "["  # EOF in multi-line statement
s = "}"  # EOF in multi-line statement
s = "{"  # EOF in multi-line statement

Oddly, other nonsensical inputs produce ERRORTOKEN values instead:

s = "$"
s = "'"
Ned Batchelder
Thanks! This was the type of information I was looking for. I was hoping there was a way to intercept (and ignore) these tokenize errors to make the tokenizer not stop parsing the others so I could (in the end) exclude non-valid 'blocks' based on the indent/dedent tokens. But it's probable -and reasonable- that the generator is in a too inconsistent/inpredictable state to 'continue' tokenizing...
ChristopheD
Exactly. There's no reasonable next token after a bogus string, especially if it's read the entire rest of the file before determining there's an error.
Ned Batchelder