views:

1233

answers:

6

Given a function, I'm trying to find out the names of the nested functions in it (only one level deep).

A simple regex against toString() worked until I started using functions with comments in them. It turns out that some browsers store parts of the raw source while others reconstruct the source from what's compiled; The output of toString() may contain the original code comments in some browsers. As an aside, here are my findings:

Test subject

function/*post-keyword*/fn/*post-name*/()/*post-parens*/{
    /*inside*/
}

document.write(fn.toString());

Results

Browser      post-keyword  post-name  post-parens  inside
-----------  ------------  ---------  -----------  --------
 Firefox      No            No         No           No
 Safari       No            No         No           No
 Chrome       No            No         Yes          Yes
 IE           Yes           Yes        Yes          Yes
 Opera        Yes           Yes        Yes          Yes

I'm looking for a cross-browser way of extracting the nested function names from a given function. The solution should be able to extract "fn1" and "fn2" out of the following function:

function someFn() {
    /**
     * Some comment
     */
     function fn1() {
         alert("/*This is not a comment, it's a string literal*/");
     }

     function // keyword
     fn2 // name
     (x, y) // arguments
     {
         /*
         body
         */
     }

     var f = function () { // anonymous, ignore
     };
}

The solution doesn't have to be pure regex.

Update: You can assume that we're always dealing with valid, properly nested code with all string literals, comments and blocks terminated properly. This is because I'm parsing a function that has already been compiled as a valid function.

Update2: If you're wondering about the motivation behind this: I'm working on a new JavaScript unit testing framework that's called jsUnity. There are several different formats in which you can write tests & test suites. One of them is a function:

function myTests() {
    function setUp() {
    }

    function tearDown() {
    }

    function testSomething() {
    }

    function testSomethingElse() {
    }
}

Since the functions are hidden inside a closure, there's no way for me invoke them from outside the function. I therefore convert the outer function to a string, extract the function names, append a "now run the given inner function" statement at the bottom and recompile it as a function with new Function(). If the test function have comments in them, it gets tricky to extract the function names and to avoid false positives. Hence I'm soliciting the help of the SO community...

Update3: I've come up with a new solution that doesn't require a lot of semantic fiddling with code. I use the original source itself to probe for first-level functions.

A: 
<pre>
<script type="text/javascript">
function someFn() {
 /**
  * Some comment
  */
  function fn1() {
   alert("/*This is not a comment, it's a string literal*/");
  }

  function // keyword
  fn2 // name
  (x, y) // arguments
  {
   /*
   body
   */
  }

  function fn3() {
  alert("this is the word function in a string literal");
  }

  var f = function () { // anonymous, ignore
  };
}

var s = someFn.toString();
// remove inline comments
s = s.replace(/\/\/.*/g, "");
// compact all whitespace to a single space
s = s.replace(/\s{2,}/g, " ");
// remove all block comments, including those in string literals
s = s.replace(/\/\*.*?\*\//g, "");
document.writeln(s);
// remove string literals to avoid false matches with the keyword 'function'
s = s.replace(/'.*?'/g, "");
s = s.replace(/".*?"/g, "");
document.writeln(s);
// find all the function definitions
var matches = s.match(/function(.*?)\(/g);
for (var ii = 1; ii < matches.length; ++ii) {
 // extract the function name
 var funcName = matches[ii].replace(/function(.+)\(/, "$1");
 // remove any remaining leading or trailing whitespace
 funcName = funcName.replace(/\s+$|^\s+/g, "");
 if (funcName === '') {
  // anonymous function, discard
  continue;
 }
 // output the results
 document.writeln('[' + funcName + ']');
}
</script>
</pre>

I'm sure I missed something, but from your requirements in the original question, I think I've met the goal, including getting rid of the possibility of finding the function keyword in string literals.

One last point, I don't see any problem with mangling the string literals in the function blocks. Your requirement was to find the function names, so I didn't bother trying to preserve the function content.

Grant Wagner
I think this will break if comments and strings don't nest 'properly' - imo there's no way around manually parsing the source code...
Christoph
You can assume the nesting is proper because I'm parsing an already compiled (valid) JavaScript function.
Ates Goral
@Ates: with 'nested improperly' I meant things like ` // " <NEWLINE> " `, ` /* " */ " `,...
Christoph
+3  A: 

Cosmetic changes and bugfix

The regular expression must read \bfunction\b to avoid false positives!

Functions defined in blocks (e.g. in the bodies of loops) will be ignored if nested does not evaluate to true.

function tokenize(code) {
    var code = code.split(/\\./).join(''),
        regex = /\bfunction\b|\(|\)|\{|\}|\/\*|\*\/|\/\/|"|'|\n|\s+/mg,
        tokens = [],
        pos = 0;

    for(var matches; matches = regex.exec(code); pos = regex.lastIndex) {
        var match = matches[0],
            matchStart = regex.lastIndex - match.length;

        if(pos < matchStart)
            tokens.push(code.substring(pos, matchStart));

        tokens.push(match);
    }

    if(pos < code.length)
        tokens.push(code.substring(pos));

    return tokens;
}

var separators = {
    '/*' : '*/',
    '//' : '\n',
    '"' : '"',
    '\'' : '\''
};

function extractInnerFunctionNames(func, nested) {
    var names = [],
        tokens = tokenize(func.toString()),
        level = 0;

    for(var i = 0; i < tokens.length; ++i) {
        var token = tokens[i];

        switch(token) {
            case '{':
            ++level;
            break;

            case '}':
            --level;
            break;

            case '/*':
            case '//':
            case '"':
            case '\'':
            var sep = separators[token];
            while(++i < tokens.length && tokens[i] !== sep);
            break;

            case 'function':
            if(level === 1 || (nested && level)) {
                while(++i < tokens.length) {
                    token = tokens[i];

                    if(token === '(')
                        break;

                    if(/^\s+$/.test(token))
                        continue;

                    if(token === '/*' || token === '//') {
                        var sep = separators[token];
                        while(++i < tokens.length && tokens[i] !== sep);
                        continue;
                    }

                    names.push(token);
                    break;
                }
            }
            break;
        }
    }

    return names;
}
Christoph
@Peter: should work now
Christoph
Yep, that appears to work here now.
Peter Boughton
Thanks for this answer Christoph. I'll write some unit tests to see if it meets all scenarios. I'm also initiating a bounty to see if anyone can come up with a shorter solution.
Ates Goral
Just came up with this alternative: http://stackoverflow.com/questions/517411/extracting-nested-function-names-from-a-javascript-function/546984#546984
Ates Goral
Functions declared inside loops aren't really "nested", just like "var" declarations inside loops aren't really nested. The functions are visible outside the loop too.
Pointy
+3  A: 

The academically correct way to handle this would be creating a lexer and parser for a subset of Javascript (the function definition), generated by a formal grammar (see this link on the subject, for example).

Take a look at JS/CC, for a Javascript parser generator.

Other solutions are just regex hacks, that lead to unmaintainable/unreadable code and probably to hidden parsing errors in particular cases.

As a side note, I'm not sure to understand why you aren't specifying the list of unit test functions in your product in a different way (an array of functions?).

friol
jsUnity supports a variety of formats, including and array of functions. The thing I like about the closure syntax is its compactness and resemblance to jUnit tests.
Ates Goral
JS/CC looks very interesting and seems to be the right path in achieving what I want.
Ates Goral
A: 

Would it matter if you defined your tests like:

var tests = {
 test1: function (){
  console.log( "test 1 ran" );
 },

 test2: function (){
  console.log( "test 2 ran" );
 },

 test3: function (){
  console.log( "test 3 ran" );
 }
};

Then you could run them as easily as this:

for( var test in tests ){ 
 tests[test]();
}

Which looks much more easier. You can even carry the tests around in JSON that way.

Mehmet Duran
@Mehmet: This is in fact a syntax already supported by jsUnity: http://code.google.com/p/jsunity/wiki/ObjectTestSuite
Ates Goral
+1  A: 

I like what you're doing with jsUnity. And when I see something I like (and have enough free time ;)), I try to reimplement it in a way which better suits my needs (also known as 'not-invented-here' syndrome).

The result of my efforts is described in this article, the code can be found here.

Feel free to rip-out any parts you like - you can assume the code to be in the public domain.

Christoph
This looks very interesting! I guess it's legal in JS to repeat the same label? I'll hopefully apply your answer to jsUnity some time soon. And thanks for the nod ;)
Ates Goral
@Ates: ECMA-262, 3rd edition, 12.12: labels are added to the label set of the statement they prefix (ie the strings in this case); it's only illegal to nest statements with the same label, eg `foo: while(true) { foo: "bar"; }`
Christoph
+1  A: 

The trick is to basically generate a probe function that will check if a given name is the name of a nested (first-level) function. The probe function uses the function body of the original function, prefixed with code to check the given name within the scope of the probe function. OK, this can be better explained with the actual code:

function splitFunction(fn) {
    var tokens =
        /^[\s\r\n]*function[\s\r\n]*([^\(\s\r\n]*?)[\s\r\n]*\([^\)\s\r\n]*\)[\s\r\n]*\{((?:[^}]*\}?)+)\}\s*$/
        .exec(fn);

    if (!tokens) {
        throw "Invalid function.";
    }

    return {
        name: tokens[1],
        body: tokens[2]
    };
}

var probeOutside = function () {
    return eval(
        "typeof $fn$ === \"function\""
        .split("$fn$")
        .join(arguments[0]));
};

function extractFunctions(fn) {
    var fnParts = splitFunction(fn);

    var probeInside = new Function(
        splitFunction(probeOutside).body + fnParts.body);

    var tokens;
    var fns = [];
    var tokenRe = /(\w+)/g;

    while ((tokens = tokenRe.exec(fnParts.body))) {
        var token = tokens[1];

        try {
            if (probeInside(token) && !probeOutside(token)) {
                fns.push(token);
            }
        } catch (e) {
            // ignore token
        }
    }

    return fns;
}

Runs fine against the following on Firefox, IE, Safari, Opera and Chrome:

function testGlobalFn() {}

function testSuite() {
    function testA() {
        function testNested() {
        }
    }

    // function testComment() {}
    // function testGlobalFn() {}

    function // comments
    testB /* don't matter */
    () // neither does whitespace
    {
        var s = "function testString() {}";
    }
}

document.write(extractFunctions(testSuite));
// writes "testA,testB"


Edit by Christoph, with inline answers by Ates:

Some comments, questions and suggestions:

  1. Is there a reason for checking

    typeof $fn$ !== "undefined" && $fn$ instanceof Function
    

    instead of using

    typeof $fn$ === "function"
    

    instanceof is less safe than using typeof because it will fail when passing objects between frame boundaries. I know that IE returns wrong typeof information for some built-in functions, but afaik instanceof will fail in these cases as well, so why the more complicated but less safe test?


[AG] There was absolutely no legitimate reason for it. I've changed it to the simpler "typeof === function" as you suggested.


  1. How are you going to prevent the wrongful exclusion of functions for which a function with the same name exists in the outer scope, e.g.

    function foo() {}
    
    
    function TestSuite() {
        function foo() {}
    }
    


[AG] I have no idea. Can you think of anything. Which one is better do you think? (a) Wrongful exclusion of a function inside. (b) Wronfgul inclusion of a function outside.

I started to think that the ideal solution will be a combination of your solution and this probing approach; figure out the real function names that are inside the closure and then use probing to collect references to the actual functions (so that they can be directly called from outside).


  1. It might be possible to modify your implementation so that the function's body only has to be eval()'ed once and not once per token, which is rather inefficient. I might try to see what I can come up with when I have some more free time today...


[AG] Note that the entire function body is not eval'd. It's only the bit that's inserted to the top of the body.

[CG] Your right - the function's body only gets parsed once during the creation of probeInside - you did some nice hacking, there ;). I have some free time today, so let's see what I can come up with...

A solution that uses your parsing method to extract the real function names could just use one eval to return an array of references to the actual functions:

return eval("[" + fnList + "]");


[CG] Here is with what I came up. An added bonus is that the outer function stays intact and thus may still act as closure around the inner functions. Just copy the code into a blank page and see if it works - no guarantees on bug-freelessness ;)

<pre><script>
var extractFunctions = (function() {
    var level, names;

    function tokenize(code) {
        var code = code.split(/\\./).join(''),
            regex = /\bfunction\b|\(|\)|\{|\}|\/\*|\*\/|\/\/|"|'|\n|\s+|\\/mg,
            tokens = [],
            pos = 0;

        for(var matches; matches = regex.exec(code); pos = regex.lastIndex) {
            var match = matches[0],
                matchStart = regex.lastIndex - match.length;

            if(pos < matchStart)
                tokens.push(code.substring(pos, matchStart));

            tokens.push(match);
        }

        if(pos < code.length)
            tokens.push(code.substring(pos));

        return tokens;
    }

    function parse(tokens, callback) {
        for(var i = 0; i < tokens.length; ++i) {
            var j = callback(tokens[i], tokens, i);
            if(j === false) break;
            else if(typeof j === 'number') i = j;
        }
    }

    function skip(tokens, idx, limiter, escapes) {
        while(++idx < tokens.length && tokens[idx] !== limiter)
            if(escapes && tokens[idx] === '\\') ++idx;

        return idx;
    }

    function removeDeclaration(token, tokens, idx) {
        switch(token) {
            case '/*':
            return skip(tokens, idx, '*/');

            case '//':
            return skip(tokens, idx, '\n');

            case ')':
            tokens.splice(0, idx + 1);
            return false;
        }
    }

    function extractTopLevelFunctionNames(token, tokens, idx) {
        switch(token) {
            case '{':
            ++level;
            return;

            case '}':
            --level;
            return;

            case '/*':
            return skip(tokens, idx, '*/');

            case '//':
            return skip(tokens, idx, '\n');

            case '"':
            case '\'':
            return skip(tokens, idx, token, true);

            case 'function':
            if(level === 1) {
                while(++idx < tokens.length) {
                    token = tokens[idx];

                    if(token === '(')
                        return idx;

                    if(/^\s+$/.test(token))
                        continue;

                    if(token === '/*') {
                        idx = skip(tokens, idx, '*/');
                        continue;
                    }

                    if(token === '//') {
                        idx = skip(tokens, idx, '\n');
                        continue;
                    }

                    names.push(token);
                    return idx;
                }
            }
            return;
        }
    }

    function getTopLevelFunctionRefs(func) {
        var tokens = tokenize(func.toString());
        parse(tokens, removeDeclaration);

        names = [], level = 0;
        parse(tokens, extractTopLevelFunctionNames);

        var code = tokens.join('') + '\nthis._refs = [' +
            names.join(',') + '];';

        return (new (new Function(code)))._refs;
    }

    return getTopLevelFunctionRefs;
})();

function testSuite() {
    function testA() {
        function testNested() {
        }
    }

    // function testComment() {}
    // function testGlobalFn() {}

    function // comments
    testB /* don't matter */
    () // neither does whitespace
    {
        var s = "function testString() {}";
    }
}

document.writeln(extractFunctions(testSuite).join('\n---\n'));
</script></pre>

Not as elegant as LISP-macros, but still nice what JAvaScript is capable of ;)

Ates Goral
1. why not `isFnTmp = "typeof $fn$ === \"function\"` - `instanceof` breaks across frame boundaries! - 2. how do you plan on handling `window.func = function func() {}`?
Christoph
3. I don't think this will performe well (didn't benchmark, shame on me :( ) - you'll have to `eval()` the whole function body for each token!
Christoph
@Christoph: Your concern #1 should be (at least to a practical extent) handled with the probeOutside addition.
Ates Goral
@Christoph: #3: You may have misread the code; the typeof check is eval'd once per each token. Of course, depending on the # of tokens in a given code block, this may be cumbersome. However, performance is not a concern since this only done at test suite compilation.
Ates Goral
@Ates: Do you have a problem with me editing your answer to add my questions there? The comments are a bit limiting...
Christoph
@Christoph: I've wiki-ized the answer to help with the collaborative effort :)
Ates Goral
@Ates: added my version
Christoph