tags:

views:

389

answers:

5

I have a source code in Fortran (almost irrelevant) and I want to parse the function names and arguments.

eg using

(\w+)\([^\(\)]+\)

with

a(b(1 + 2 * 2), c(3,4))

I get the following: (as expected)

b, 1 + 2 * 2
c, 3,4

where I would need

a, b(1 + 2 * 2), c(3,4)
b, 1 + 2 * 2
c, 3,4

Any suggestions?

Thanks for your time...

+2  A: 

I don't think this is a job for regular expressions... they can't really handle nested patterns.

This is because regexes are compiled into FSMs (Finite State Machines). In order to parse arbitrarily nested expressions, you can't use a FSM, because you need infinitely many states to keep track of the arbitrary nesting. Also see this SO thread.

David Zaslavsky
+2  A: 

This is a nonlinear grammar -- you need to be able to recurse on a set of allowed rules. Look at pyparsing to do simple CFG (Context Free Grammar) parsing via readable specifications.

It's been a while since I've written out CFGs, and I'm probably rusty, so I'll refer you to the Python EBNF to get an idea of how you can construct one for a subset of a language syntax.

Edit: If the example will always be simple, you can code a small state machine class/function that iterates over the tokenized input string, as @Devin Jeanpierre suggests.

cdleary
+2  A: 

It can be done with regular expressions-- use them to tokenize the string, and work with the tokens. i.e. see re.Scanner. Alternatively, just use pyparsing.

Devin Jeanpierre
Right, you can tokenize and use your own state machine you can do it, but that's technically not just using regular expressions.
cdleary
Thing is that I didn't see "just regex" in the question, only "regex". The re module includes a Scanner, and has for ages-- not that it's documented (bleh).
Devin Jeanpierre
Oooh cool, undocumented features! +1 from me for telling me to look. :-)
cdleary
+1  A: 

You can't do this with regular expression only. It's sort of recursive. You should match first the most external function and its arguments, print the name of the fuction, then do the same (match the function name, then its arguments) with all its arguments. Regex alone are not enough.

Andrea Ambu
+2  A: 

You can take a look at PLY (Python Lex-Yacc), it's (in my opinion) very simple to use and well documented, and it comes with a calculator example which could be a good starting point.

Paolo Tedesco