views:

62

answers:

5

I need a REGEX that can find blocks of PHP code in a file. For example:

    <? print '<?xml version="1.0" encoding="UTF-8"?>';?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"&gt;

    <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
    <head>
        <?php echo "stuff"; ?>
    </head>
    <html>

When parsed would by the REGEX would return:

array(
    "<? print '<?xml version=\"1.0\" encoding="UTF-8"?>';?>",
    "<? echo \"stuff\"; ?>"
);

You can assume the PHP is valid.

A: 

Try the following regex using preg_match()

/<\?(?:php)?\s+(.*?)\?>/

That's untested, but is a start. It assumes a closing PHP tag (arguably well-formed).

Jason McCreary
I need to deal with cases that have <?'s as strings (as in the example) in side of other PHP tags.
Kendall Hopkins
PHP is not XML. It cannot be *well-formed* in that sense.
Gordon
@Gordon: "[a] well-formed formula [...] is a word [...] which is part of a formal language" http://en.wikipedia.org/wiki/Well-formed_formula
back2dos
@Kendall, yeah, that's a tough one. You may want to go with **Gumbo** above. On a side note, I don't encourage the use of PHP short tags.
Jason McCreary
@back2dos Words get their meaning depending on the context they are used in. *Well-formed* in context of *closing tag* is refering to XML (or SGML), which PHP is not. In addition, when not used in a Template context, it is advisable to leave out the closing tag. This prevents any whitespace being outputted by scripts that contain whitespace after the closing tag that would then interfere with sending headers.
Gordon
@Gordon: I understood it this way: PHP is a formal language and a closing tag is just a symbol. In this context, PHP source code may or may not be well-formed. I suppose, it feels strange, when a nesting construct is not closed by a terminating symbol, although appearently the languages grammar allows this.
back2dos
@Gordon: Considering your other point, I agree half-heartedly, since pragmatically you're right. However, IMHO it's a flaw the PHP parser also outputs white-space-only content, specifically, because PHP is intended and virtually always used to generate whitespace-insensitive output. The most annoying thing is, when it start outputting BOMs. This problem should not have an impact on your coding style, but should be handled by your toolchain. If the compiler/interpreter fails, then I think writing a small tool, that ensures this problem doesn't appear is better than adapting your coding style.
back2dos
@back2dos well, this would take too long to discuss and I was just nitpicking in the initial comment anyway, so let's leave it at that.
Gordon
@Jason McCreary The idea of this script is to translate templates done in PHP to Twig for security reasons. So I completely agree with you, I'm parsing other people's code, and was trying to give a good example.
Kendall Hopkins
@Kendall. I understand. **Gumbo** knocked this one out of the park. I'd go with that.
Jason McCreary
+2  A: 

This is the type of task that is much better suited for a custom parser. You could relatively easily construct one using a stack and I can guarantee you will be done much quicker and pull less hair out than you would trying to debug your regex.

Regular expressions are great tools when used appropriately but not all text parsing tasks are equal.

Miky Dinescu
A real parser isn't really necessary. A tokenizer will do the job. Luckily PHP has one built right in, as Gumbo pointed out. :)
back2dos
A: 

Try this regex(untested):

preg_match_all('@<\?.*?\?>@si',$html,$m);
print_r($m[0]);
turbod
A: 
<\?(?:php)?\s+.*?\?>$

with the following modifiers:

Dot match newlines

^& match at line breaks

Liwen
+7  A: 

With token_get_all you get a list of PHP language tokens of a given PHP code. Then you just need to iterate the list, look for the open tag tokens and for the corresponding close tags.

$blocks = array();
$opened = false;
foreach (token_get_all($code) as $token) {
    if (!$opened) {
        if (is_array($token) && ($token[0] === T_OPEN_TAG || $token[0] === T_OPEN_TAG_WITH_ECHO)) {
            $opened = true;
            $buffer = $token[1];
        }
    } else {
        if (is_array($token)) {
            $buffer .= $token[1];
            if ($token[0] === T_CLOSE_TAG) {
                $opened = false;
                $blocks[] = $buffer;
            }
        } else {
            $buffer .= $token;
        }
    }
}
Gumbo
I had no idea PHP supported this kinda of self parsing. Thank you so much.
Kendall Hopkins
That's really cool. Didn't know that either. I was also very surprised to discover this gem: http://www.php.net/manual/en/function.parsekit-compile-file.php
back2dos