views:

72

answers:

3

I am using Ply to interpret a FORTRAN format string. I am having trouble writing a regex to match the 'H' edit descriptor which is of the form

xHccccc ...

where x specifies the number of characters to read in after the 'H'

Ply matches tokens with a single regular expression, but I am having trouble using regular expression to perform the above. I am looking for something like,

(\d+)[Hh].{\1}

where \1 is parsed as an integer and evaluated as part of the regex - however it isn't.

It seems that it is not possible to use matched numbers later in the same regex, is this the case?

Does anyone have any other solutions that might use Ply?

+2  A: 

Regex can't do things like that. You can hack it though:

(1[Hh].|2[Hh]..|3[Hh]...|etc...)

Ugly!

Mark Byers
Crude and limited, but effective. Good idea for a one off.
dmckee
A: 

This is what comes of thinking that regexps can replace a lexer.

Short version: regular expressions can only deal with that small subset of all possible language termed "regular" (big surprise, I know). But "regular" is not isomorphic to the human understanding of "simple", so even very simple languages can have non-regular expressions.

Writing a lexer for a simple language is not terribly hard.

That canonical Stack Overflow question for resources on the topic is Learning to write a compiler.


Ah. I seem to have misunderstood the question. Mea Culpa.

I'm not familiar with ply, and its been a while since I used flex, but think you would eat any number of following digits, then check in the associated code block if the rules had been obeyed.

dmckee
Ply is a Python library that implements lex and yacc style rules within Python for creating a Lexer/Parser. I was under the impression that using lex/yacc will save me a lot of tedious coding when writing parsers
Brendan
A: 

Pyparsing includes an adaptive expression that is very similar to this, called countedArray. countedArray(expr) parses a leading integer 'n' and then parses 'n' instances of expr, returning the whole array as a single list. The way this works is that countedArray parses a leading integer expression, followed by an uninitialized Forward expression. The leading integer expression has a parse action attached that assigns the following Forward to 'n'*expr. The pyparsing parser then continues on, and parses the following 'n' expr's. So it is sort of a self-modifying parser.

To parse your expression, this would look something like:

integer = Word(nums).setParseAction(lambda t:int(t[0]))
following = Forward()
integer.addParseAction(lambda t: following << Word(printables+" ",exact=t[0]))
H_expr = integer + 'H' + following
print H_expr.parseString("22HThis is a test string.This is not in the string")

Prints:

[22, 'H', 'This is a test string.']

If Ply has something similar, perhaps you could use this technique.

Paul McGuire
Thanks, that's useful to know. I had considered Pyparsing earlier but decided to go with the more UNIX'ey old school lex/yacc Ply way - and now the parser is all but written save this last detail!
Brendan