views:

61

answers:

2

When using preg_replace() in PHP with strings generated at runtime, one can protect special regex characters (such as '$' or '+') in the search string by using preg_quote(). But what's the correct way to handle this in the replacement string? Take this code for example:

<?php

$haystack = '...a bit of sample text...';
$replacement = '\\HELLO WORLD$1.+-';
$replacement_quoted = preg_quote($replacement);

var_dump('--replacement', $replacement, '--replacement_quoted',
    $replacement_quoted, '--haystack', $haystack);

$result1 = preg_replace("@(bit) (of) (sample)@is", "\${1}" . $replacement ."$3", $haystack);
$result2 = preg_replace("@(bit) (of) (sample)@is", "\${1}" . $replacement_quoted ."$3", $haystack);

$replacement_new1 = str_replace('$', '\$', $replacement);
$replacement_new2 = str_replace('\\', '\\\\', $replacement_new1);

$result3 = preg_replace("@(bit) (of) (sample)@is", "\${1}" . $replacement_new1 ."$3", $haystack);
$result4 = preg_replace("@(bit) (of) (sample)@is", "\${1}" . $replacement_new2 ."$3", $haystack);

var_dump('--result1 (not quoted)', $result1, '--result2 (quoted)', $result2,
    '--result3 ($ escaped)', $result3, '--result4 (\ and $ escaped)', $result3);

?>

Here's the output:

string(13) "--replacement"
string(17) "\HELLO WORLD$1.+-"
string(20) "--replacement_quoted"
string(22) "\\HELLO WORLD\$1\.\+\-"
string(10) "--haystack"
string(26) "...a bit of sample text..."
string(22) "--result1 (not quoted)"
string(40) "...a bit\HELLO WORLDbit.+-sample text..."
string(18) "--result2 (quoted)"
string(42) "...a bit\HELLO WORLD$1\.\+\-sample text..."
string(21) "--result3 ($ escaped)"
string(39) "...a bit\HELLO WORLD$1.+-sample text..."
string(27) "--result4 (\ and $ escaped)"
string(39) "...a bit\HELLO WORLD$1.+-sample text..."

As you can see, you can't win with preg_quote(). If you don't call it and just pass the string in unmodified (result1), anything that looks like a capture token ($1 above) gets replaced with whatever the corresponding capture group contained. If you do call it (result2), you have no problems with the capture groups, but any other special PCRE characters (such as *) get escaped as well, and the escaped characters manage to live on in the output. Also interesting to me is that both versions produce a single \ in the output.

Only by manually quoting characters, in particular, the $, can you get this to work. This can be seen in result3 and result4. Continuing the oddness with the \, however, both result3, which adds escaping for \, and result4 again produce a single \ in the output. Adding six \ characters at the beginning of the replacement string produces just two \ in the final output for result1, result3, and result4, and three of them for result2.

So, it would seem that most issues are taken care of by manually escaping the $ character. It seems like the \ character also needs to be escaped, but I need to think about that one some more to figure exactly out what's happing. In any case, this is all quite ugly - between the annoying \${1} syntax and having to manually escape certain characters, the code just smells really rotten and error-prone. Is there something I'm missing? Is there a clean way to do this?

A: 

$subject = array('1', 'a', '2', 'b', '3', 'A', 'B', '4'); $pattern = array('/\d/', '/[a-z]/', '/[1a]/'); $replace = array('A:$0', 'B:$0', 'C:$0');

echo "preg_filter returns\n"; print_r(preg_filter($pattern, $replace, $subject));

echo "preg_replace returns\n"; print_r(preg_replace($pattern, $replace, $subject));

preg_filter returns Array ( [0] => A:C:1 [1] => B:C:a [2] => A:2 [3] => B:b [4] => A:3 [7] => A:4 ) preg_replace returns Array ( [0] => A:C:1 [1] => B:C:a [2] => A:2 [3] => B:b [4] => A:3 [5] => A [6] => B [7] => A:4 )

zod
Sorry, but I'm lost - what's the point of this? I see that it's taken directly from the manual page for `preg_filter()` at http://us3.php.net/preg_filter, but it doesn't really have anything to do with my question. And without any accompanying commentary....
mr. w
+1  A: 

Okay, well, I don't think there's any really satisfying way to handle this. The problems are two in number: the \ character and the $ character. Other PCRE special characters appear to not be special in the replacement.

In the case of \, things actually behave as one would expect in that you need to escape it with \ both with defining it via PHP and when passing it into preg_replace(). In my test code, I was simply confusing myself with the two layers of escaping. As for $, it should be left alone on the PHP side and escaped with \ going into preg_replace(). That's it.

Here's some code to demonstrate all this:

<?php

ini_set('display_errors', 1);
ini_set('error_reporting', E_ALL | E_STRICT);

//real string: "test1 $1 test2 \\1 test3 \${1}"

//real string manually \-escaped once for representing as a PHP string
$test = 'test1 $1 test2 \\\\1 test3 \\${1}';
var_dump('--test (starting PHP string - should match real string)', $test);

$test = str_replace(array('\\', '$'), array('\\\\', '\\$'), $test);
var_dump('--test (PHP string $-escaped and \-escaped again for preg_replace)', $test);

$result = preg_replace("/bar/", $test, 'foo bar baz');

var_dump('--result - bar should be replaced with original real string', $result);

?>

Output:

string(55) "--test (starting PHP string - should match real string)"
string(30) "test1 $1 test2 \\1 test3 \${1}"
string(66) "--test (PHP string $-escaped and \-escaped again for preg_replace)"
string(35) "test1 \$1 test2 \\\\1 test3 \\\${1}"
string(59) "--result - bar should be replaced with original real string"
string(38) "foo test1 $1 test2 \\1 test3 \${1} baz"

My feeling is that preg_quote() should be the solution here, and it would be if preg_replace() would ignore escaped characters other than \ itself and $ (e.g., +). However, it doesn't, forcing one to do the manual escaping. In fact, I would argue that this is a bug, and will pursue filing it as such on php.net.

mr. w
I've filed a bug - [(#52962)](http://bugs.php.net/bug.php?id=52962).
mr. w