views:

2305

answers:

3

I have a CMS that uses a syntax based on HTML comments to let the user insert flash video players, slideshows, and other 'hard' code that the user could not easily write.

The syntax for one FLV movies looks like this: <!--PLAYER=filename.flv-->

I use this code:

$find_players = preg_match("/<!--PLAYER\=(.*)-->/si", $html_content, $match);

This works great if there is only one player, $match[1] contains the filename (which is all I need)

My knowledge of regex is vanishing, so I'm not able to adjust this to grab more than one match.

If there are more on the page, it breaks totally, because it matches too greedily (from the first <!--PLAYER to the last -->

+1  A: 
$find_players = preg_match("/<!--PLAYER\=(.*?)-->/i", $html_content, $match);

(.*?)

should work just fine.

Stevens
The 'm' (multiline) flag is not needed here; it changes the meaning of the ^ and $ metacharacters, which aren't being used. It's the 's' flag that allows the dot to match line separators.
Alan Moore
s and m modifiers should not be needed here.
OIS
+2  A: 

You probably want the regex modifier U (PCRE_UNGREEDY, to match ungreedily). This will fetch the shortest possible match, meaning that you won't match from the beginning of the first <!--PLAYER= to the end of the last -->

An abbreviated example:

<?php
$text = "blah\n<!-x=abc->blah<!-x=def->blah\n\nblah<!-x=ghi->\nblahblah" ;
$reg  = "/<!-x=(.*)->/U" ;
preg_match_all( $reg, $text, $matches ) ;
print_r( $matches ) ;

Your code then becomes:

$find_players = preg_match_all("/<!--PLAYER=(.*)-->/Ui", $html_content, $matches);
// print $matches[1] ;

The 's' modifier (PCRE_DOTALL) you're using probably isn't helpful, either; you're unlikely to have a filename with a linebreak in it.

EDIT: @Stevens suggests this syntax, which I agree is slightly clearer - moving the U modifier to the capturing parentheses.

$find_players = preg_match_all("/<!--PLAYER=(?U)(.*)-->/i", $html_content, $matches);
Chris Burgess
+1, but why are you escaping the equals sign?
PEZ
doh! copied it from the question. fixed - thx.
Chris Burgess
I think its bad style to code the non-greedy with a modifier...
Stevens
good suggestion Stevens, have edited accordingly
Chris Burgess
wonderful, worked perfectly, thanks!
Buzz
or just /<!--PLAYER=(.*?)-->/ . The question mark after the asterisk reverses the greediness.
monzee
+1  A: 

When working with regular expressions, it's typically more performant to use a more specific expression rather than a "lazy dot", which generally causes excessive backtracking. You can use a negative lookahead to achieve the same results without overburdening the regex engine:

$find_players = preg_match("/<!--PLAYER=((?:[^-]+|-(?!->))*)-->/ig", $html_content, $match);

Mind you, it's unlikely that using the lazy dot will cause noticeable problems with a simple case like this, but it's a good habit to always tell the regex engine exactly what you mean. In this case, you want to collect as many characters as possible ("greedy") without passing a comment terminator. A terminator is a dash followed by another dash and a greater-than sign. So, we allow any number of any character except dash or dashes that don't start a comment terminator.

Ben Blank