views:

70

answers:

4

I am using the following regex:

(public|private +)?function +([a-zA-Z_$][0-9a-zA-Z_$]*) *\\(([0-9a-zA-Z_$, ]*)\\) *{(.*)}

To match the following string:

public function messenger(text){
sendMsg(text);
}
private function sendMsg(text){
alert(text);
}

(There is no line breaks in the string, they are converted to whitespaces before the regex runs)

I wanted it to capture both functions, but it is capturing: $1: "" $2: "messenger" $3: "text" $4: " sendMsg(text); } private function sendMsg(text){ alert(text); "

By the way, I am using Javascript.

+1  A: 

Try changing

(.*)

to

(.*?)
Tim Green
This is not going to work if there is another `}` within the function body. E.g., something like `if(foo){ doSomething; }` in the function body will break your solution. You cannot use regex to parse a non-regular text.
macek
+3  A: 

By default, the * operator is greedy, consuming as many characters as possible. Try *?, the non-greedy equivalent.

/((?:(?:public|private)\s+)?)function\s+([a-zA-Z_$][\w$]*)\s*\(([\w$, ]*)\)\s*{(.*?)}/

\w matches words and is equivalent to [a-zA-Z0-9_] but can be used in character classes. Note that this won't match functions with blocks in them, such as:

function foo() {
    for (p in this) {
      ...
    }
}

That's tricky to impossible to do with regexps unless they support recursion (which JS's don't), which is why you need a proper parser.

outis
Thank you for avoiding tunnel vision, outis :)
macek
+1  A: 

Change this last part of your regex:

{(.*)}

To this:

{(.*?)}

This makes it "non-greedy", so that it doesn't capture to the last } in the input.

Note that this will break if any of the function code ever includes a } character, but then you're dealing with nesting, which is never something that regular expressions do well.

Chad Birch
This is not going to work if there is another `}` within the function body. E.g., something like `if(foo){ doSomething; }` in the function body will break your solution. You cannot use regex to parse a non-regular text.
macek
@smotchkkiss, this is actually a kind of circular argument, because a regular language is by definition the one that regexps can parse. Basically, "you cannot parse something you cannot parse", don't you know.
stereofrog
+2  A: 

Because you accepted my (wrong) answer in the other thread, I feel myself kind of obliged to post a proper solution. This is not going to be quick and short, but hopefully helps a bit.

The following is how I would write a regexp-based parser for a c-alike language if I had to.

<script>
/* 
Let's start with this simple utility function. It's a
kind of stubborn version of String.replace() - it
checks the string over and over again, until nothing
more can be replaced
*/

function replaceAll(str, regexp, repl) {
    str = str.toString();
    while(str.match(regexp))
        str = str.replace(regexp, repl);
    return str;
}

/*
Next, we need a function that removes specific
constructs from the text and replaces them with
special "markers", which are "invisible" for further
processing. The matches are collected in a buffer so
that they can be restored later.
*/

function isolate(type, str, regexp, buf) {
    return replaceAll(str, regexp, function($0) {
        buf.push($0);
        return "<<" + type + (buf.length - 1) + ">>";
    });
} 

/*
The following restores "isolated" strings from the
buffer:
*/

function restore(str, buf) {
    return replaceAll(str, /<<[a-z]+(\d+)>>/g, function($0, $1) {
        return buf[parseInt($1)];
    });
}

/*
Write down the grammar. Javascript regexps are
notoriously hard to read (there is no "comment"
option like in perl), therefore let's use more
readable format with spacing and substitution
variables. Note that "$string" and "$block" rules are
actually "isolate()" markers.
*/

var grammar = {
    $nothing: "",
    $space:  "\\s",
    $access: "public $space+ | private $space+ | $nothing",
    $ident:  "[a-z_]\\w*",
    $args:   "[^()]*",
    $string: "<<string [0-9]+>>",
    $block:  "<<block [0-9]+>>",
    $fun:    "($access) function $space* ($ident) $space* \\( ($args) \\) $space* ($block)"
}

/*
This compiles the grammar to pure regexps - one for
each grammar rule:
*/

function compile(grammar) {
    var re = {};
    for(var p in grammar)
        re[p] = new RegExp(
            replaceAll(grammar[p], /\$\w+/g, 
                    function($0) { return grammar[$0] }).
            replace(/\s+/g, ""), 
        "gi");
    return re;
}

/*
Let's put everything together
*/

function findFunctions(code, callback) {
    var buf = [];

    // isolate strings
    code = isolate("string", code, /"(\\.|[^\"])*"/g, buf);

    // isolate blocks in curly brackets {...}
    code = isolate("block",  code, /{[^{}]*}/g, buf);

    // compile our grammar
    var re = compile(grammar);

    // and perform an action for each function we can find
    code.replace(re.$fun, function() {
        var p = [];
        for(var i = 1; i < arguments.length; i++)
            p.push(restore(arguments[i], buf));
        return callback.apply(this, p)
    });
}
</script>

Now we're ready to test. Our parser must be able to deal with escaped strings and arbitrary nested blocks.

<code>
public function blah(arg1, arg2) {
    if("some string" == "public function") {
        callAnother("{hello}")
        while(something) {
            alert("escaped \" string");
        }
    }
}

function yetAnother() { alert("blah") }
</code>

<script>
window.onload = function() {
    var code = document.getElementsByTagName("code")[0].innerHTML;
    findFunctions(code, function(access, name, args, body) {
        document.write(
            "<br>" + 
            "<br> access= " + access +
            "<br> name= "   + name +
            "<br> args= "   + args +
            "<br> body= "   + body
        )
    });
}
</script> 
stereofrog