ansaurus

Question

Python - lexical analysis and tokenization

Answer 1

A:

If you can change the format of the input files, then you could use a parser for an existing format, such as JSON.

However, from your problem statement it sounds like that isn't the case. So if you want to create a custom lexer and parser, use PLY (Python Lex/Yacc). It is easy to use and works the same as lex/yacc.

Here is a link to an example of a calculator built using PLY. Note that everything starting with t_ is a lexer rule - defining a valid token - and everything starting with p_ is a parser rule that defines a production of the grammar.

danben 2010-03-01 20:39:28

Answer 2

+2 A:

For as simple as your format seems to be, I think a full-on parser/lexer would be way overkill. Seems like a combination of regexes and string manipulation would do the trick.

Another idea is to change the file to something like json or xml and use an existing package.

zdav 2010-03-01 20:50:03

Answer 3

+1 A:

A simple DFA works well for this. You only need a few states:

Looking for ${
Seen ${ looking for at least one valid character forming the name
Seen at least one valid name character, looking for more name characters or }.

If the properties file is order agnostic, you might want a two pass processor to verify that each name resolves correctly.

Of course, you then need to write the substitution code, but once you have a list of all the names used, the simplest possible implementation is a find/replace on ${name} with its corresponding value.

Kaleb Pederson 2010-03-01 21:22:52

Answer 4

A:

The syntax you provide seems similar to Mako templates engine. I think you could give it a try, it's rather simple API.

Dmitry Kochkin 2010-03-01 22:24:40

Answer 5

+2 A:

There's an excellent article on Using Regular Expressions for Lexical Analysis at effbot.org.

Adapting the tokenizer to your problem:

import re

token_pattern = r"""
(?P<identifier>[a-zA-Z_][a-zA-Z0-9_]*)
|(?P<integer>[0-9]+)
|(?P<dot>\.)
|(?P<open_variable>[$][{])
|(?P<open_curly>[{])
|(?P<close_curly>[}])
|(?P<newline>\n)
|(?P<whitespace>\s+)
|(?P<equals>[=])
|(?P<slash>[/])
"""

token_re = re.compile(token_pattern, re.VERBOSE)

class TokenizerException(Exception): pass

def tokenize(text):
    pos = 0
    while True:
        m = token_re.match(text, pos)
        if not m: break
        pos = m.end()
        tokname = m.lastgroup
        tokvalue = m.group(tokname)
        yield tokname, tokvalue
    if pos != len(text):
        raise TokenizerException('tokenizer stopped at pos %r of %r' % (
            pos, len(text)))

To test it, we do:

stuff = r'property.${general.name}.ip = ${general.ip}'
stuff2 = r'''
general {
  name = myname
  ip = 127.0.0.1
}
'''

print ' stuff '.center(60, '=')
for tok in tokenize(stuff):
    print tok

print ' stuff2 '.center(60, '=')
for tok in tokenize(stuff2):
    print tok

for:

========================== stuff ===========================
('identifier', 'property')
('dot', '.')
('open_variable', '${')
('identifier', 'general')
('dot', '.')
('identifier', 'name')
('close_curly', '}')
('dot', '.')
('identifier', 'ip')
('whitespace', ' ')
('equals', '=')
('whitespace', ' ')
('open_variable', '${')
('identifier', 'general')
('dot', '.')
('identifier', 'ip')
('close_curly', '}')
========================== stuff2 ==========================
('newline', '\n')
('identifier', 'general')
('whitespace', ' ')
('open_curly', '{')
('newline', '\n')
('whitespace', '  ')
('identifier', 'name')
('whitespace', ' ')
('equals', '=')
('whitespace', ' ')
('identifier', 'myname')
('newline', '\n')
('whitespace', '  ')
('identifier', 'ip')
('whitespace', ' ')
('equals', '=')
('whitespace', ' ')
('integer', '127')
('dot', '.')
('integer', '0')
('dot', '.')
('integer', '0')
('dot', '.')
('integer', '1')
('newline', '\n')
('close_curly', '}')
('newline', '\n')

Matt Anderson 2010-03-01 22:36:45

ansaurus

tags:

views:

answers:

Python - lexical analysis and tokenization

related questions