views:

156

answers:

1

I'm trying to write a C module to lexically analyse Python code. How can I do it?

+6  A: 

The complete, detailed specification for doing lexical analysis of Python code is here.

As you can see, there are a lot of cases you need to cover. One help is that you will always be able to check most easily if your C-implemented lexical analyzer is correct for a given Python fragment: it will have to return exactly what the Python-implemented module tokenize in Python's standard library does.

As you can see in tokenize's sources, it's several hundred lines of Python, so you can easily extrapolate to needing thousands of lines of C -- definitely not a weekend project;-)

Of course, as a starting point, you can fork Python's own Parser/tokenizer.c -- that's less than 2000 lines (amazingly short for what it does!), but in good part because it's relying on quite a few other bits and pieces from Python's runtime (if your implementation needs to be stand-alone you'll therefore need to reproduce those).

If you're a very experienced programmer with strong understanding of the Python's codebase, and can just sprint on this for all your waking hours, you might make it in a week or so. Under normal circumstances, I'd say expecting a month of work would be a bit optimistic. What's your deadline?

Alex Martelli
I'd also ask why you want to do this in C rather than in Python.
Noufal Ibrahim