ansaurus

Question

Regex to parse define() contents, possible?

Answer 1

A:

You might not need to go overboard with the regex complexity - something like this will probably suffice

 /DEFINE\('(.*?)',\s*'(.*)'\);/

Here's a PHP sample showing how you might use it

$lines=file("myconstants.php");
foreach($lines as $line) {
    $matches=array();
    if (preg_match('/DEFINE\(\'(.*?)\',\s*\'(.*)\'\);/i', $line, $matches)) {
        $name=$matches[1];
        $value=$matches[2];

        echo "$name = $value\n";
    }

}

Paul Dixon 2009-03-14 12:55:38

Thanks Paul. This only checks for the pattern define('text', 'value') right? -- i mean if i wanted to gather text and value next.. how would i do so?

Ahmad Fouad 2009-03-14 13:04:39

I am impressed by your fast response. Many thanks

Ahmad Fouad 2009-03-14 13:21:23

Worked like a charm! ~ saved me a lot!

Ahmad Fouad 2009-03-14 13:42:03

Err... this fails on any trivial variation eg define("blah", "foo"). Also on any spacing other than what you have, defines spanning multiple lines, heredocs and so foth.

cletus 2009-03-14 14:04:10

This even fails on define('const','foo');

cletus 2009-03-14 14:08:39

cletus, I like this because its very short and suits my needs perfectly. Assume that the translations are always in single quotes and are always strings.. anyway to make it not fail on define('constant','value') ? i mean if the space is missed out?

Ahmad Fouad 2009-03-14 14:31:34

I've modified it. The tokeniser approach is a better all-round approach, but given the tightly-defined problem, this is a concise alternative.

Paul Dixon 2009-03-14 14:36:02

Paul. It works now even If i miss the space after the comma, thanks! I think this is so short and solves my tight problem as you said. The tokeniser method also works perfectly! I am now confused on which method to go for...

Ahmad Fouad 2009-03-14 14:38:55

Answer 2

+1 A:

This is possible, but I would rather use get_defined_constants(). But make sure all your translations have something in common (like all translations starting with T), so you can tell them apart from other constants.

soulmerge 2009-03-14 12:57:05

But doing this I might have to edit too many lines? They do not have something in common.. so I thought I could use a regex thats why

Ahmad Fouad 2009-03-14 13:02:00

Answer 3

A:

Try this regular expression to find the define calls:

 /\bdefine\(\s*("(?:[^"\\]+|\\(?:\\\\)*.)*"|'(?:[^'\\]+|\\(?:\\\\)*.)*')\s*,\s*("(?:[^"\\]+|\\(?:\\\\)*.)*"|'(?:[^'\\]+|\\(?:\\\\)*.)*')\s*\);/is

So:

$pattern = '/\\bdefine\\(\\s*("(?:[^"\\\\]+|\\\\(?:\\\\\\\\)*.)*"|\'(?:[^\'\\\\]+|\\\\(?:\\\\\\\\)*.)*\')\\s*,\\s*("(?:[^"\\\\]+|\\\\(?:\\\\\\\\)*.)*"|\'(?:[^\'\\\\]+|\\\\(?:\\\\\\\\)*.)*\')\\s*\\);/is';
$str = '<?php define(\'foo\', \'bar\'); define("define(\\\'foo\\\', \\\'bar\\\')", "define(\'foo\', \'bar\')"); ?>';
preg_match_all($pattern, $str, $matches, PREG_SET_ORDER);
var_dump($matches);

I know that eval is evil. But that’s the best way to evaluate the string expressions:

$constants = array();
foreach ($matches as $match) {
    eval('$constants['.$match[1].'] = '.$match[1].';');
}
var_dump($constants);

Gumbo 2009-03-14 13:30:16

Answer 4

+7 A:

For any kind of grammar-based parsing, regular expressions are usually an awful solution. Even smple grammars (like arithmetic) have nesting and it's on nesting (in particular) that regular expressions just fall over.

Fortunately PHP provides a far, far better solution for you by giving you access to the same lexical analyzer used by the PHP interpreter via the token_get_all() function. Give it a character stream of PHP code and it'll parse it into tokens ("lexemes"), which you can do a bit of simple parsing on with a pretty simple finite state machine.

Run this program (it's run as test.php so it tries it on itself). The file is deliberately formatted badly so you can see it handles that with ease.

<?
    define('CONST1', 'value'   );
define   (CONST2, 'value2');
define(   'CONST3', time());
  define('define', 'define');
    define("test", VALUE4);
define('const5', //

'weird declaration'
)    ;
define('CONST7', 3.14);
define ( /* comment */ 'foo', 'bar');
$defn = 'blah';
define($defn, 'foo');
define( 'CONST4', define('CONST5', 6));

header('Content-Type: text/plain');

$defines = array();
$state = 0;
$key = '';
$value = '';

$file = file_get_contents('test.php');
$tokens = token_get_all($file);
$token = reset($tokens);
while ($token) {
//    dump($state, $token);
    if (is_array($token)) {
        if ($token[0] == T_WHITESPACE || $token[0] == T_COMMENT || $token[0] == T_DOC_COMMENT) {
            // do nothing
        } else if ($token[0] == T_STRING && strtolower($token[1]) == 'define') {
            $state = 1;
        } else if ($state == 2 && is_constant($token[0])) {
            $key = $token[1];
            $state = 3;
        } else if ($state == 4 && is_constant($token[0])) {
            $value = $token[1];
            $state = 5;
        }
    } else {
        $symbol = trim($token);
        if ($symbol == '(' && $state == 1) {
            $state = 2;
        } else if ($symbol == ',' && $state == 3) {
            $state = 4;
        } else if ($symbol == ')' && $state == 5) {
            $defines[strip($key)] = strip($value);
            $state = 0;
        }
    }
    $token = next($tokens);
}

foreach ($defines as $k => $v) {
    echo "'$k' => '$v'\n";
}

function is_constant($token) {
    return $token == T_CONSTANT_ENCAPSED_STRING || $token == T_STRING ||
        $token == T_LNUMBER || $token == T_DNUMBER;
}

function dump($state, $token) {
    if (is_array($token)) {
        echo "$state: " . token_name($token[0]) . " [$token[1]] on line $token[2]\n";
    } else {
        echo "$state: Symbol '$token'\n";
    }
}

function strip($value) {
    return preg_replace('!^([\'"])(.*)\1$!', '$2', $value);
}
?>

Output:

'CONST1' => 'value'
'CONST2' => 'value2'
'CONST3' => 'time'
'define' => 'define'
'test' => 'VALUE4'
'const5' => 'weird declaration'
'CONST7' => '3.14'
'foo' => 'bar'
'CONST5' => '6'

This is basically a finite state machine that looks for the pattern:

function name ('define')
open parenthesis
constant
comma
constant
close parenthesis

in the lexical stream of a PHP source file and treats the two constants as a (name,value) pair. In doing so it handles nested define() statements (as per the results) and ignores whitespace and comments as well as working across multiple lines.

Note: I've deliberatley made it ignore the case when functions and variables are constant names or values but you can extend it to that as you wish.

It's also worth pointing out that PHP is quite forgiving when it comes to strings. They can be declared with single quotes, double quotes or (in certain circumstances) with no quotes at all. This can be (as pointed out by Gumbo) be an ambiguous reference reference to a constant and you have no way of knowing which it is (no guaranteed way anyway), giving you the chocie of:

Ignoring that style of strings (T_STRING);
Seeing if a constant has already been declared with that name and replacing it's value. There's no way you can know what other files have been called though nor can you process any defines that are conditionally created so you can't say with any certainty if anything is definitely a constant or not nor what value it has; or
You can just live with the possibility that these might be constants (which is unlikely) and just treat them as strings.

Personally I would go for (1) then (3).

cletus 2009-03-14 13:34:31

What if CONST2 is already a constant? `define('foo', 'bar'); define(foo, 'baz');` => foo='bar', bar='baz'

Gumbo 2009-03-14 13:38:53

CONST2 is a T_STRING constant. With extra checking you could check to see if you get a T_STRING constant and then use is_defined() on it, getting the value or, if its not defined, treating it as a string (as PHP does).

cletus 2009-03-14 13:42:40

“CONST2 is a T_STRING constant.” – Oh, I forgot: it’s PHP. ;)

Gumbo 2009-03-14 13:44:18

I am so impressed, it returns the output like as the book says. right now the 2 solutions are good and working. thanks both of you

Ahmad Fouad 2009-03-14 13:51:25

Answer 5

A:

Not every problem with text should be solved with a regexp, so I'd suggest you state what you want to achieve and not how.

So, instead of using php's parser which is not really useful, or instead of using a completely undebuggable regexp, why not write a simple parser?

<?php

$str = "define('nam\\'e', 'va\\\\\\'lue');\ndefine('na\\\\me2', 'value\\'2');\nDEFINE('a', 'b');";

function getDefined($str) {
    $lines = array();
    preg_match_all('#^define[(][ ]*(.*?)[ ]*[)];$#mi', $str, $lines);

    $res = array();
    foreach ($lines[1] as $cnt) {
     $p = 0;
     $key = parseString($cnt, $p);
     // Skip comma
     $p++;
     // Skip space
     while ($cnt{$p} == " ") {
      $p++;
     }
     $value = parseString($cnt, $p);

     $res[$key] = $value;
    }

    return $res;
}

function parseString($s, &$p) {
    $quotechar = $s[$p];
    if (! in_array($quotechar, array("'", '"'))) {
     throw new Exception("Invalid quote character '" . $quotechar . "', input is " . var_export($s, true) . " @ " . $p);
    }

    $len = strlen($s);
    $quoted = false;
    $res = "";

    for ($p++;$p < $len;$p++) {
     if ($quoted) {
      $quoted = false;
      $res .= $s{$p};
     } else {
      if ($s{$p} == "\\") {
       $quoted = true;
       continue;
      }
      if ($s{$p} == $quotechar) {
       $p++;
       return $res;
      }
      $res .= $s{$p};
     }
    }

    throw new Exception("Premature end of line");
}

var_dump(getDefined($str));

Output:

array(3) {
  ["nam'e"]=>
  string(7) "va\'lue"
  ["na\me2"]=>
  string(7) "value'2"
  ["a"]=>
  string(1) "b"
}

phihag 2009-03-14 14:02:04

ansaurus

tags:

views:

answers:

Regex to parse define() contents, possible?

related questions