ansaurus

Question

tokenize module

Answer 1

+2 A:

You will need to read python's code tokenizer.c to understand the detail. Just search the keyword you want to know. Should be not hard.

kcwu 2009-05-13 08:43:01

Answer 2

+1 A:

Python's lexical analysis (including tokens) is documented at http://docs.python.org/reference/lexical_analysis.html . As http://docs.python.org/library/token.html#module-token says, "Refer to the file Grammar/Grammar in the Python distribution for the definitions of the names in the context of the language grammar.".

Alex Martelli 2009-05-13 08:44:21

Answer 3

A:

The various AMPER, BACKQUOTE etc values correspond to the token number of the appropriate symbol for python tokens / operators. ie AMPER = & (ampersand), AMPEREQUAL = "&=".

However, you don't actually have to care about these. They're used by the internal C tokeniser, but the python wrapper simplifies the output, translating all operator symbols to the OP token. You can translate the symbolic token ids (the first value in each token tuple) to the symbolic name using the token module's tok_name dictionary. For example:

>>> import tokenize, token
>>> s = "{'test':'123','hehe':['hooray',0x10]}"
>>> for t in tokenize.generate_tokens(iter([s]).next):
        print token.tok_name[t[0]],

OP STRING OP STRING OP STRING OP OP STRING OP NUMBER OP OP ENDMARKER

As a quick debug statement to describe the tokens a bit better, you could also use tokenize.printtoken. This is undocumented, and looks like it isn't present in python3, so don't rely on it for production code, but as a quick peek at what the tokens mean, you may find it useful:

>>> for t in tokenize.generate_tokens(iter([s]).next):
        tokenize.printtoken(*t)

1,0-1,1:        OP      '{'
1,1-1,7:        STRING  "'test'"
1,7-1,8:        OP      ':'
1,8-1,13:       STRING  "'123'"
1,13-1,14:      OP      ','
1,14-1,20:      STRING  "'hehe'"
1,20-1,21:      OP      ':'
1,21-1,22:      OP      '['
1,22-1,30:      STRING  "'hooray'"
1,30-1,31:      OP      ','
1,31-1,35:      NUMBER  '0x10'
1,35-1,36:      OP      ']'
1,36-1,37:      OP      '}'
2,0-2,0:        ENDMARKER       ''

The various values in the tuple you get back for each token are, in order:

token Id (corresponds to the type, eg STRING, OP, NAME etc)
The string - the actual token text for this token, eg "&" or "'a string'"
The start (line, column) in your input
The end (line, column) in your input
The full text of the line the token is on.

Brian 2009-05-14 08:52:08

ansaurus

tags:

views:

answers:

tokenize module

related questions