tags:

views:

113

answers:

2

Hi

I'm trying to match a certain set of tags in a template file. I however want the tags to be able to be nested in itself.

My regex is the following: (with /s)

<!-- START (.*?) -->(.*?)<!-- END \\1 -->

Tag example:

<!-- START yList -->
  y:{yList:NUM} | 
  <!-- START xList -->
    x:{xList:NUM} 
  <!-- END xList -->
  <!-- CARET xList -->
  <br>
<!-- END yList -->
<!-- CARET yList -->

Right now the matches result will be:

match 0:

group(0) (Whole match)

<!-- START yList --> 
 y 
 <!-- START xList --> 
   x 
 <!-- END xList --> 
 <!-- CARET xList --> 
 <br> 
<!-- END yList -->

group(1)

yList

group(2)

y 
<!-- START xList --> 
  x 
<!-- END xList --> 
<!-- CARET xList --> 
<br>

I want 2 matches instead of 1 obviously, the nested tag set isn't matched. Is this possible with regex, or should I just keep regexing group(2) results, untill i've found no new matches?

+5  A: 

Regular expressions are not suited for parsing arbitrary-depth tree structures. It may be possible to do, depending on the regex flavor you are using, but not recommended - they are difficult to read and difficult to debug as well.

I would suggest writing a simple parser instead. What you do is decompose your text into a set of possible tokens which can each be defined by simple regular expressions, e.g.:

START_TOKEN = "<!-- START [A-Za-z] -->"
END_TOKEN = ...
HTML_TEXT = ...

Iterate over your string, and as long as you match these tokens, pull them out of the string, and store them in a separate list. Be sure to save the text that was inside the token (if any) when you do this.

Then you can iterate over your list of tokens, and based on the token types you can create a nested tree structure of nodes, each containing either 1) the text of the original token, and 2) a list of child nodes.

You may want to look at some parser tutorials if this seems too complicated.

Fragsworth
Interesting. Can you recommend any parsing tutorials?
meder
A: 

You could do something like this:

$parts = preg_split('/(<!-- (?:START|END|CARET) [a-zA-Z][a-zA-Z0-9]* -->)/', $str, -1, PREG_SPLIT_DELIM_CAPTURE);
$tokens = array();
$isTag = isset($tokens[0]) && preg_match('/^<!-- (?:START|END|CARET) [a-zA-Z][a-zA-Z0-9]* -->$/', $tokens[0]);
foreach ($parts as $part) {
    if ($isTag) {
        preg_match('/^<!-- (START|END|CARET) ([a-zA-Z][a-zA-Z0-9]*) -->$/', $token, $match);
        $tokens[] = array($match[1], $match[2]);
    } else {
        if ($token !== '') $tokens[] = $token;
    }
    $isTag = !$isTag;
}
var_dump($tokens);

That will give you the structure of your code.

Gumbo