tags:

views:

53

answers:

2

Greetings All,

I need to optimize a RegEx I am using to parse template tags in my CMS. A tag can be either a single tag or a matching pair. An example of some tags:

{static:input:title type="input"}

{static:image:picture}<img src="{$img.src}" width="{$img.width}" height="{$img.height"} />{/static:image:picture}

Here is the RegEx I currently have that properly selects what I need but I ran it through the RegexBuddy debugger and it takes tens of thousands of steps to do one match if the HTML page is quite large.

{static([\w:]*)?\s?(.*?)}(?!"|')(?:((?:(?!{static\1).)*?){/static\1})?

When this matches a tag, Group 1 is the parameters which is all the colon separated words. Group 2 is the parameters. And Group 3 (If it's a tag pair) is the content between each tag.

I'm also having problems when I stick these tags inside my conditional tags as well. Something like this doesn't match group 2 properly (Group 2 should be blank in both the matched tags below):

{if "{static:image:image1}"!=""}
    <a href="{static:image:image1}" rel="example_group" title="Image 1"></a></li>
{/if}

Another situation that needs to work is have the same tag being used twice in a row but the first instance being used a single tag and the second being used as a tag pair. So something like this:

{static:image:picture}
{static:image:picture}<img src="{$img.src}" width="{$img.width}" height="{$img.height"} />{/static:image:picture}

There needs to be two separate matches. The first match would have only group 1. The second match would have group 1 and group 3.

If anyone needs more information, please don't hesitate to ask. The CMS is built in PHP using the CakePHP framework.

Big kudos to anyone who can help me out :D!

A: 

Your syntax is too complicated for regular expressions. You need a context-free grammar. (Read up on the Chomsky hierarchy to understand why.)

I second the recommendation to use an existing template language (such as Smarty) rather than inventing your own.

Zack
Greetings Zack, thanks for your reply. I'm sorry but I do not have the mental capacity to follow anything that has been written on that article you posted. As for my syntax, it seems pretty simple to me. I've got the RegEx working, it just needs some fine tuning.As for using another template language, I'm not going to do that. I haven't found one I like nor one that will fit nicely into how my CMS works. The one I've written works but just needs some fine tuning.
Cody Lundquist
I'll try to summarize: you showed several examples that hinge on tags being nested inside other tags' parameters or content. It is *mathematically impossible* for regular expressions to process a syntax that involves nesting. You may be able to get it working on toy examples but you will not be able to make it handle the general case. A context-free parser is the correct tool for this job. And it will make your performance problem go away, too. http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to-match-nested-patterns discusses the problem you're facing with less math.
Zack
I see what you are getting at but no, my template engine does not need to be recursive like you think. These tags are not allowed to be inside of one another. The only nesting is my conditional parser and those are working just fine. So you can't have something like: {static:input:foo param="{static:input:bar}"}.
Cody Lundquist
That's basically what I meant by "you may be able to get it working on toy examples". Technically, context-free parsers are only *necessary* if you have *unlimited* nesting; up to any fixed level you *can* do it with regexes. However, you are making life harder for yourself by insisting on regexes. Your second and third problems (where things don't match what you want them to) are trivial to solve in a context-free grammar, but enormously difficult in REs, and like I said, your performance problem will also vanish if you switch.
Zack
A: 

I've come up with a solution that is working very nicely for now. I'm going through and grabbing all the paired tags first and then grabbing the single tags after that. I then use PHP to do the recursive aspect of tags being inside other tags content.

The suggestion that dogmatic69 came up with might be a more complete fix further down the track.

Thank you all for your suggestions and possible solutions.

Cody Lundquist