views:

163

answers:

1

Please consider the following code with which I'm trying to parse only the first phpDoc style comment (noy using any other libraries) in a file (file contents put in $data variable for testing purposes):

$data = "
/**
 * @file    A lot of info about this file
 *          Could even continue on the next line
 * @author  [email protected]
 * @version 2010-05-01
 * @todo    do stuff...
 */

/**
 * Comment bij functie bar()
 * @param Array met dingen
 */
function bar($baz) {
  echo $baz;
}
";

$data =  trim(preg_replace('/\r?\n *\* */', ' ', $data));
preg_match_all('/@([a-z]+)\s+(.*?)\s*(?=$|@[a-z]+\s)/s', $data, $matches); 
$info = array_combine($matches[1], $matches[2]);
print_r($info)

This almose works, except for the fact that everything after @todo (including the bar() comment block and code) is considered the value of @todo:

Array (
    [file] => A lot of info about this file Could even continue on the next line
    [author] => [email protected]
    [version] => 2010-05-01
    [todo] => do stuff... /

    /** Comment bij functie bar()
    [param] => Array met dingen /
    function bar() {
      echo ;
    }
)

How does my code need to be altered so that only the first comment block is being parsed (in other words: parsing should stop after the first "*/" encountered?

+3  A: 

Writing a parser using PCRE will lead you to troubles. I would suggest to rely on the tokenizer (http://www.php.net/tokenizer) or reflection (http://www.php.net/manual/en/reflectionclass.getdoccomment.php) first. Then it is safer to actually implement a parser for the doc block, which can handle all situations supported by the phpdoc format (what all libs ended to do as well).

Pierre
Thanks for a quick reply. In reality, I have to loop through many files, collecting the FIRST commentblock of every file (only the one describing the file; I don't need to collect the other commentblocks describing functions, methods, etc. The downside of using tokenizer is that I can't tell token_get_all() to stop looking for commentsblocks after the first one has been found. This results in a **huge** array which takes about 20-30 seconds to compile, which is too long since I have to recompile on every page request (don't ask...).
Reveller
The advantage of regex is that one could instruct it to stop looking after the first commentblock of a file has been found, resulting in better performance. Or is there a workaround (see my code below using tokenizer)? foreach ($files as $file) { $data = file("$file.inc.php")); $tokens = token_get_all($data); foreach ($tokens as $token) { list($id, $text) = $token; switch ($id) { case T_DOC_COMMENT: $return[] = $token; break; default: break; } } print_r($return);
Reveller