tags:

views:

790

answers:

5

I'm having an issue with my regex.

I want to capture <% some stuff %> and i need what's inside the <% and the %>

This regex works quite well for that.

$matches = preg_split("/<%[\s]*(.*?)[\s]*%>/i",$markup,-1,(PREG_SPLIT_NO_EMPTY  |  PREG_SPLIT_DELIM_CAPTURE));

I also want to catch &% some stuff %&gt; so I need to capture <% or &lt;% and %> or %&gt; respectively.

If I put in a second set of parens, it makes preg_split function differently (because as you can see from the flag, I'm trying to capture what's inside the parens.

Preferably, it would only match &lt; to &gt; and < to > as well, but that's not completely necessary

EDIT: The SUBJECT may contain multiple matches, and I need all of them

+9  A: 

In your case, it's better to use preg_match with its additional parameter and parenthesis:

preg_match("#((?:<|&lt;)%)([\s]*(?:[^ø]*)[\s]*?)(%(?:>|&gt;))#i",$markup, $out);
print_r($out);

Array
(
    [0] => <% your stuff %>
    [1] => <%
    [2] => your stuff
    [3] => %>
)

By the way, check this online tool to debug PHP regexp, it's so useful !

http://regex.larsolavtorvik.com/

EDIT : I hacked the regexp a bit so it's faster. Tested it, it works :-)

Now let's explain all that stuff :

  • preg_match will store everything he captures in the var passed as third param (here $out)
  • if preg_match matches something, it will be store in $out[0]
  • anything that is inside () but not (?:) in the pattern will be stored in $out

The patten in details :

#((?:<|&lt;)%)([\s]*(?:[^ø]*)[\s]*?)(%(?:>|&gt;))#i can be viewed as ((?:<|&lt;)%) + ([\s]*(?:[^ø]*)[\s]*?) + (%(?:>|&gt;)).

((?:<|&lt;)%) is capturing < or &lt; then %
(%(?:>|&gt;)) is capturing % then < or &gt; 
([\s]*(?:[^ø]*)[\s]*?) means 0 or more spaces, then 0 or more times anything that is not the ø symbol, the 0 or more spaces.

Why do we use [^ø] instead of . ? It's because . is very time consuming, the regexp engine will check among all the existing characters. [^ø] just check if the char is not ø. Nobody uses ø, it's an international money symbol, but if you care, you can replace it by chr(7) wich is the shell bell char that's obviously will never be typed in a web page.

EDIT2 : I just read your edit about capturing all the matches. In that case, you´ll use preg_match_all the same way.

e-satis
A: 

One possible solution is to use the extra parens, like so, but to ditch those in the results, so you actually only use 1/2 of the total restults.

this regex

$matches = preg_split("/(<|&lt;)%[\s]*(.*?)[\s]*%(>|&gt;)/i",$markup,-1,(PREG_SPLIT_NO_EMPTY  |  PREG_SPLIT_DELIM_CAPTURE));

for input

Hi my name is <h1>Issac</h1><% some stuff %>here&lt;% more stuff %&gt;

output would be

Array(
 [0]=>Hi my name is <h1>Issac</h1>
 [1]=><
 [2]=>some stuff
 [3]=>>
 [4]=>here
 [5]=>&;lt;
 [6]=>more stuff
 [7]=>&gt;
)

Which would give the desired resutls, if I only used the even numbers

Issac Kelly
+1  A: 

Why are you using preg_split if what you really want is what matches inside the parentheses? Seems like it would be simpler to just use preg_match.

It's often an issue with regex that parens are used both for grouping your logic and for capturing patterns.

According to the PHP doc on regex syntax,

The fact that plain parentheses fulfil two functions is not always helpful. There are often times when a grouping subpattern is required without a capturing requirement. If an opening parenthesis is followed by "?:", the subpattern does not do any capturing, and is not counted when computing the number of any subsequent capturing subpatterns.

Tegan Mulholland
+2  A: 
<?php
$code = 'Here is a <% test %> and &lt;% another test %&gt; for you';
preg_match_all('/(<|&lt;)%\s*(.*?)\s*%(>|&gt;)/', $code, $matches);
print_r($matches[2]);
?>

Result:

Array
(
    [0] => test
    [1] => another test
)
_Lasar
+1  A: 

If you want to match give preg_match_all a shot with a regular expression like this:

preg_match_all('/((\<\%)(\s)(.*?)(\s)(\%>))/i', '<% wtf %> <% sadfdsafds %>', $result);

This results in a match of just about everything under the sun. You can add/remove parens to match more/less:

Array ( [0] => Array ( [0] => <% wtf %> [1] => <% sadfdsafds %> )

[1] => Array
    (
        [0] => <% wtf %>
        [1] => <% sadfdsafds %>
    )

[2] => Array
    (
        [0] => <%
        [1] => <%
    )

[3] => Array
    (
        [0] =>  
        [1] =>  
    )

[4] => Array
    (
        [0] => wtf
        [1] => sadfdsafds
    )

[5] => Array
    (
        [0] =>  
        [1] =>  
    )

[6] => Array
    (
        [0] => %>
        [1] => %>
    )

)