views:

111

answers:

1

When attempting to tokenize a string in python3.0, why do I get a leading 'utf-8' before the tokens start?

From the python3 docs, tokenize should now be used as follows:

g = tokenize(BytesIO(s.encode('utf-8')).readline)

However, when attempting this at the terminal, the following happens:

>>> from tokenize import tokenize
>>> from io import BytesIO
>>> g = tokenize(BytesIO('foo'.encode()).readline)
>>> next(g)
(57, 'utf-8', (0, 0), (0, 0), '')
>>> next(g)
(1, 'foo', (1, 0), (1, 3), 'foo')
>>> next(g)
(0, '', (2, 0), (2, 0), '')
>>> next(g)

What's with the utf-8 token that precedes the others? Is this supposed to happen? If so, then should I just always skip the first token?

[edit]

I have found that token type 57 is tokenize.ENCODING, which can easily be filtered out of the token stream if need be.

+2  A: 

That's the coding cookie of the source. You can specify one explicitly:

# -*- coding: utf-8 -*-
do_it()

Otherwise Python assumes the default encoding, utf-8 in Python 3.

Benjamin Peterson