views:

259

answers:

3

I am trying to find all the inputs/outputs of all MATLAB functions in our internal library. I am new (first time) to regex and have been trying to use the multiline mode in Python's re library.

The MATLAB function syntax looks like:

function output = func_name(input)

where the signature can span multiple lines.

I started with a pattern like:

re.compile(r"^.*function (.*)=(.*)\([.\n]*\)$", re.M)

but I keep getting an unsupported template operator error. Any pointer is appreciated!

EDIT:

Now I have:

pattern = re.compile(r"^\s*function (.*?)= [\w\n.]*?\(.*?\)", re.M|re.DOTALL)

which gives matches like:

        function [fcst, spread] = ...
                VolFcstMKT(R,...
                           mktVol,...
                           calibrate,...
                           spread_init,...
                           fcstdays,...
                           tsperyear)

        if(calibrate)
            if(nargin < 6)
                tsperyear = 252;
            end
            templen = length(R)

My question is why does it give the extra lines instead of stopping at the first )?

+4  A: 

The peculiar (internal) error you're getting should come if you pass re.T instead of re.M as the second argument to re.compile (re.template -- a currently undocumented entry -- is the one intended to use it, and, in brief, template REs don't support repetition or backtracking). Can you print re.M to show what's its value in your code before you call this re.compile?

Once that's fixed, we can discuss the details of your desired RE (in brief: if the input part can include parentheses you're out of luck, otherwise re.DOTALL and some rewriting of your pattern should help) -- but fixing this weird internal error occurrence seems to take priority.

Edit: with this bug diagnosed (as per the comments below this Q), moving on to the OP's current question: the re.DOTALL|re.MULTINE, plus the '$' at the end of the pattern, plus the everywhere-greedy matches (using .*, instead of .*? for non-greedy), all together ensure that if the regex matches it will match as broad a swathe as possible... that's exactly what this combo is asking for. Probably best to open another Q with a specific example: what's the input, what gets matched, what would you like the regex to match instead, etc.

Alex Martelli
print re.Mprint re.MULTILINEprint re.Tgive me 8True1
leon
@leon, that `True` is scary because its numerical value is in fact 1. Can you confirm that `int(re.MULTILINE)` is indeed 1 and that's what you're passing to `compile`? Please edit your Q so you can format things: as you can see by looking at yours, comments lack code formatting as so are often unreadable if there's any code in them!
Alex Martelli
I confirm `int(re.MULTILINE)` is 1. I tried to format the comment but did not know how to.
leon
Comments' formatting abilities are limited, that's why I say to edit your original question instead. So anyway, something has broken your `re.MULTILINE`, which should be 8, **not** 1 (nor `True` nor anything else). What does `python -c 'import re; print re.MULTILINE'` print? This tells us whether to look for this horrible breakage in `re.py` or other module imported at startup, or else, in your code and other code it imports. Anyway, this damage to the `re` module entirely explains your peculiar internal error.
Alex Martelli
That gives me 8
leon
@leon, so something is breaking it between the start of the program (when it's `8`) and the instant in which you try to use it (when it's `True`, which `==1`). So search for assignments to `re.MULTILINE` throughout your code and what you import, add `print re.MULTILINE` statements so you can narrow down the code region which has that horrible bug, use a debugger to do step by step (e.g. insert `import pdb; pdb.set_trace()` at the start of your code to make an interactive breakpoint), and so on, and so forth. I can't debug it FOR you, I've just shown you WHAT the bug is!-)
Alex Martelli
A: 

how about normal Python string operations? Just an example only

for line in open("file"):
    sline=line.strip()
    if sline.startswith("function"):
       lhs,rhs =sline.split("=")
       out=lhs.replace("function ","")
       if "[" in out and "]" in out:
          out=out.replace("]","").replace("[","").split(",")
       print out
       m=rhs.find("(")
       if m!=-1:
          rhs=rhs[m:].replace(")","").replace("(","").split(",")           
       print rhs

output example

$ cat file
function [mean,stdev] = stat(x)
n = length(x);
mean = sum(x)/n;
stdev = sqrt(sum((x-mean).^2/n));
function mean = avg(x,n)
mean = sum(x)/n;
$ python python.py
['mean', 'stdev ']
[' statx']
mean
[' avgx', 'n']

Of course, there should be many other scenarios of declaring functions in Matlab, like function nothing, function a = b etc , so add those checks yourself.

ghostdog74
the difficult part of the problem is that arguments are expected to span multiple lines. If it is single line, I can easily match everything around the equal sign.
leon
+2  A: 

Here's a regular expression that should match any MATLAB function declaration at the start of an m-file:

^\s*function\s+((\[[\w\s,.]*\]|[\w]*)\s*=)?[\s.]*\w+(\([^)]*\))?

And here's a more detailed explanation of the components:

^\s*             # Match 0 or more whitespace characters
                 #    at the start
function         # Match the word function
\s+              # Match 1 or more whitespace characters
(                # Start grouping 1
 (               # Start grouping 2
  \[             # Match opening bracket
  [\w\s,.]*      # Match 0 or more letters, numbers,
                 #    whitespace, underscores, commas,
                 #    or periods...
  \]             # Match closing bracket
  |[\w]*         # ... or match 0 or more letters,
                 #    numbers, or underscores
 )               # End grouping 2
 \s*             # Match 0 or more whitespace characters
 =               # Match an equal sign
)?               # End grouping 1; Match it 0 or 1 times
[\s.]*           # Match 0 or more whitespace characters
                 #    or periods
\w+              # Match 1 or more letters, numbers, or
                 #    underscores
(                # Start grouping 3
 \(              # Match opening parenthesis
 [^)]*           # Match 0 or more characters that
                 #    aren't a closing parenthesis
 \)              # Match closing parenthesis
)?               # End grouping 3; Match it 0 or 1 times

Whether you use regular expressions or basic string operations, you should keep in mind the different forms that the function declaration can take in MATLAB. The general form is:

function [out1,out2,...] = func_name(in1,in2,...)

Specifically, you could see any of the following forms:

function func_name                 %# No inputs or outputs
function func_name(in1)            %# 1 input
function func_name(in1,in2)        %# 2 inputs
function out1 = func_name          %# 1 output
function [out1] = func_name        %# Also 1 output
function [out1,out2] = func_name   %# 2 outputs
...

You can also have line continuations (...) at many points, like after the equal sign or within the argument list:

function out1 = ...
    func_name(in1,...
              in2,...
              in3)

You may also want to take into account factors like variable input argument lists and ignored input arguments:

function func_name(varargin)       %# Any number of inputs possible
function func_name(in1,~,in3)      %# Second of three inputs is ignored

Of course, many m-files contain more than 1 function, so you will have to decide how to deal with subfunctions, nested functions, and potentially even anonymous functions (which have a different declaration syntax).

gnovice