ansaurus

Question

Answer 1

+2 A:

re.match is anchored. You can give it a position argument:

pos = 0
end = len(text)
while pos < end:
    match = regexp.match(text, pos)
    # do something with your match
    pos = match.end()

Have a look for pygments which ships a shitload of lexers for syntax highlighting purposes with different implementations, most based on regular expressions.

Armin Ronacher 2008-09-25 15:38:49

How does this help ?

Eli Bendersky 2008-09-25 15:52:28

How does what help? Anchoring? No need to slice the text.

Armin Ronacher 2008-09-25 19:36:05

I see. So I gues I'll be able to save the time slicing takes ?

Eli Bendersky 2008-09-26 04:52:20

Not only the time, also the memory for the slice. What's also important is that if you use anchoring "^" and "$" will work as expected.

Armin Ronacher 2008-09-26 05:12:18

Answer 2

+1 A:

This isn't exactly a direct answer to your question, but you might want to look at ANTLR. According to this document the python code generation target should be up to date.

As to your regexes, there are really two ways to go about speeding it up if you're sticking to regexes. The first would be to order your regexes in the order of the probability of finding them in a default text. You could figure adding a simple profiler to the code that collected token counts for each token type and running the lexer on a body of work. The other solution would be to bucket sort your regexes (since your key space, being a character, is relatively small) and then use a array or dictionary to perform the needed regexes after performing a single discrimination on the first character.

However, I think that if you're going to go this route, you should really try something like ANTLR which will be easier to maintain, faster, and less likely to have bugs.

Douglas Mayle 2008-09-25 15:40:47

Answer 3

+3 A:

You can merge all your regexes into one using the "|" operator and let the regex library do the work of discerning between tokens. Some care should be taken to ensure the preference of tokens (for example to avoid matching a keyword as an identifier).

Rafał Dowgird 2008-09-25 15:54:53

How do I make it return the right type for each one of the choices ?

Eli Bendersky 2008-09-25 16:07:04

Use capturing groups. Enclosing a part of a regex in parentheses makes it a capturing group that can be retrieved from the match object, for example re.match("(a)|(b)","b").groups() = (None,"b"). The first group didn't match, the second one matched "b".

Rafał Dowgird 2008-09-25 17:14:02

But I'll still have to linearly walk over the capture groups ?

Eli Bendersky 2008-09-26 04:51:42

I think that using named capture groups, together with the lastgroup attribute of the match object lets you avoid the walk. For example re.match("(?P<ag>a)|(?P<bg>b)","b").lastgroup='bg'

Rafał Dowgird 2008-09-26 07:14:44

Answer 4

+3 A:

It's possible that combining the token regexes will work, but you'd have to benchmark it. Something like:

x = re.compile('(?P<NUMBER>[0-9]+)|(?P<VAR>[a-z]+)')
a = x.match('9999').groupdict() # => {'VAR': None, 'NUMBER': '9999'}
if a:
    token = [a for a in a.items() if a[1] != None][0]

The filter is where you'll have to do some benchmarking...

Update: I tested this, and it seems as though if you combine all the tokens as stated and write a function like:

def find_token(lst):
    for tok in lst:
        if tok[1] != None: return tok
    raise Exception

You'll get roughly the same speed (maybe a teensy faster) for this. I believe the speedup must be in the number of calls to match, but the loop for token discrimination is still there, which of course kills it.

Andrew Gwozdziewycz 2008-09-25 19:24:13

Answer 5

A:

these are not so simple, but may be worth looking at...

python module pyparsing (pyparsing.wikispaces.com) allows specifying grammar - then using it to parse text. Douglas, thanks for the post about ANTLR I haven't heard of it. Also there's PLY - python2 and python3 compatible implementation of lex/yacc.

I've written an ad-hoc regex-based parser myself first, but later realized that I might benefit from using some mature parsing tool and learning concepts of context independent grammar, etc.

The advantage of using grammar for parsing is that you can easily modify the rules and formalize quite complex syntax for whatever you are parsing.

Evgeny 2009-06-05 01:04:16

ansaurus

tags:

views:

answers:

Simple regex-based lexer in Python

related questions