views:

219

answers:

3

I am trying to match a series of text strings with PCRE on PHP, and am having trouble getting all the matches in between the first and second.

If anyone wonders why on Earth I would want to do this, it's because of Doc Comments. Oh, how I wish Zend would make native/plugin functions to read Doc Comments from a PHP file...

The following example (plain) text will be used for the problem. It will always be pure PHP code, with only one opening tag at the beginning of the file, no closing. You can assume that the syntax will always be correct.

<?php
  class someClass extends someExample
  {
    function doSomething($someArg = 'someValue')
    {
      // Nested code blocks...
      if($boolTest){}
    }
    private function killFurbies(){}
    protected function runSomething(){}
  }

  abstract
  class anotherClass
  {
    public function __construct(){}
    abstract function saveTheWhales();
  }

  function globalFunc(){}

Problem

Trying to match all methods in a class; my RegEx does not find the method killFurbies() at all. Letting it be greedy means it only matches the last method in a class, and letting it be lazy means it only matches the first method.

$part = '.*';  // Greedy
$part = '.*?'; // Lazy

$regex = '%class(?:\\n|\\r|\\s)+([a-zA-Z_\\x7f-\\xff][a-zA-Z0-9_\\x7f-\\xff]*)'
       . '.*?\{' . $part .'(?:(public|protected|private)(?:\\n|\\r|\\s)+)?'
       . 'function(?:\\n|\\r|\\s)+([a-zA-Z_\\x7f-\\xff][a-zA-Z0-9_\\x7f-\\xff'
       . ']*)(?:\\n|\\r|\\s)*\\(%ms';

preg_match_all($regex, file_get_contents(__EXAMPLE__), $matches, PREG_SET_ORDER);
var_dump($matches);

Results in:

// Lazy:
array(2) {
  [0]=>
  array(4) {
    [0]=>
    // Omitted.
    [1]=>
    string(9) "someClass"
    [2]=>
    string(0) ""
    [3]=>
    string(11) "doSomething"
  }
  [1]=>
  array(4) {
    [0]=>
    // Omitted.
    [1]=>
    string(12) "anotherClass"
    [2]=>
    string(6) "public"
    [3]=>
    string(11) "__construct"
  }
}

// Greedy:
array(2) {
  [0]=>
  array(4) {
    [0]=>
    // Omitted.
    [1]=>
    string(9) "someClass"
    [2]=>
    string(0) ""
    [3]=>
    string(13) "saveTheWhales"
  }
  [1]=>
  array(4) {
    [0]=>
    // Omitted.
    [1]=>
    string(12) "anotherClass"
    [2]=>
    string(0) ""
    [3]=>
    string(13) "saveTheWhales"
  }
}

How do I match all? :S

Any help would be gratefully appreciated, as I already feel this question is ridiculous as I'm typing it out. Anyone attempting to answer a question like this is braver than me!

Thanks, mniz.

A: 

Better use token_get_all to get the tokens of a PHP code and iterate them. PHPDoc style comments tokens can be identified with T_DOC_COMMENT.

Gumbo
Never thought of using tokens. I was reading up on them a couple of days too! All I need to do now is find T_DOC_COMMENTs that come directly before a method definition, and then find the class definition before it for the name of the class. Fun times :)
mynameiszanders
A: 

Err, can't you just parse the source using token_get_all and look for the tokens of type T_DOC_COMMENT (changed from T_COMMENT to T_DOC_COMMENT, see Gumnbo's post)?

An example of how to use this token_get_all function can be found here.

Bart Kiers
A: 

Solution

I've come up with a class to extract Doc Comments for classes and methods in a file. Thanks to all the people who answered this question, and the other on matching code blocks.

The average benchmarks for the following example is between 0.00495 and 0.00505 seconds.

<?php

$file = 'path/to/libraries/tokenizer.php';
include $file;
$tokenizer = new Tokenizer;
// Start Benchmarking here.
$tokenizer->load($file);
// End Benchmarking here.
// The following will output 'bool(false)'.
var_dump($tokenizer->get_doc('Tokenizer', 'get_tokens'));
// The following will output 'string(18) "/** load method */"'.

Tokenizer (yes, I still haven't thought of a better name for it...) Class:

<?php

class Tokenizer
{

  private $compiled = false, $path = false, $tokens = false, $classes = array();

  /** load method */
  public function load($path)
  {
    $path = realpath($path);
    if(!file_exists($path) || !function_exists('token_get_all'))
    {
      return false;
    }
    $this->compiled = false;
    $this->classes = array();
    $this->path = $path;
    $this->tokens = false;

    $this->get_tokens();
    $this->get_classes();
    $this->class_blocks();
    $this->class_functions();
    return true;
  }

  protected function get_tokens()
  {
    $tokens = token_get_all(file_get_contents($this->path));
    $compiled = '';
    foreach($tokens as $k => $t)
    {
      if(is_array($t) && $t[0] != T_WHITESPACE)
      {
        $compiled .= $k . ':' . $t[0] . ',';
      }
      else
      {
        if($t == '{' || $t == '}')
        {
          $compiled .= $t . ',';
        }
      }
    }
    $this->tokens = $tokens;
    $this->compiled = trim($compiled, ',');
  }

  protected function get_classes()
  {
    if(!$this->compiled)
    {
      return false;
    }
    $regex = '%(?:(\\d+)\\:366,)?(?:\\d+\\:(?:345|344|353),)?\\d+\\:352,(\\d+)\\:307,(?:\\d+\\:(?:354|355),\\d+\\:307,)*{%';
    preg_match_all($regex, $this->compiled, $classes, PREG_SET_ORDER);
    if(is_array($classes))
    {
      foreach($classes as $class)
      {
        $this->classes[$this->tokens[$class[2]][1]] = array('token' => $class[2]);
        $this->classes[$this->tokens[$class[2]][1]]['doc'] = isset($this->tokens[$class[1]][1]) ? $this->tokens[$class[1]][1] : false;
      }
    }
  }

  private function class_blocks()
  {
    if(!$this->compiled)
    {
      return false;
    }
    foreach($this->classes as $class_name => $class)
    {
      $this->classes[$class_name]['block'] = $this->get_block($class['token']);
    }
  }

  protected function get_block($name_token)
  {
    if(!$this->compiled || ($pos = strpos($this->compiled, $name_token . ':')) === false)
    {
      return false;
    }
    $section= substr($this->compiled, $pos);
    $len = strlen($section);
    $block = '';
    $opening = 1;
    $closing = 0;
    for($i = 0; $i < $len; $i++)
    {
      if($section[$i] == '{')
      {
        $opening++;
      }
      elseif($section[$i] == '}')
      {
        $closing++;
        if($closing == $opening)
        {
          break;
        }
      }
      if($opening > 0)
      {
        $block .= $section[$i];
      }
    }
    return trim($block, ',');
  }

  protected function class_functions()
  {
    if(!$this->compiled)
    {
      return false;
    }
    foreach($this->classes as $class_name => $class)
    {
      $regex = '%(?:(\d+)\:366,)?(?:\d+\:(?:344|345),)?(?:\d+\:(?:341|342|343),)?\d+\:333,(\d+)\:307,\{%';
      preg_match_all($regex, $class['block'], $functions, PREG_SET_ORDER);
      foreach($functions as $function)
      {
        $function_name = $this->tokens[$function[2]][1];
        $this->classes[$class_name]['functions'][$function_name] = array('token' => $function[2]);
        $this->classes[$class_name]['functions'][$function_name]['doc'] = isset($this->tokens[$function[1]][1]) ? $this->tokens[$function[1]][1] : false;
        $this->classes[$class_name]['functions'][$function_name]['block'] = $this->get_block($function[2]);
      }
    }
  }

  public function get_doc($class, $function = false)
  {
    if(!is_string($class) || !isset($this->classes[$class]))
    {
      return false;
    }
    if(!is_string($function))
    {
      return $this->classes[$class]['doc'];
    }
    else
    {
      if(!isset($this->classes[$class]['functions'][$function]))
      {
        return false;
      }
      return $this->classes[$class]['functions'][$function]['doc'];
    }
  }

}

Any thoughts or comments on this? All criticism welcome!

Thanks, mniz.

mynameiszanders