tags:

views:

66

answers:

2

Here's a piece of code from the xss_clean method of the Input_Core class of the Kohana framework:

do
{
 // Remove really unwanted tags
 $old_data = $data;
 $data = preg_replace('#</*(?:applet|b(?:ase|gsound|link)|embed|frame(?:set)?|i(?:frame|layer)|l(?:ayer|ink)|meta|object|s(?:cript|tyle)|title|xml)[^>]*+>#i', '', $data);
}
while ($old_data !== $data);

Is the do ... while loop necessary? I would think that the preg_replace call would do all the work in just one iteration.

+3  A: 

Well, it's necessary if the replacement potentially creates new matches in the next iteration. It's not very wasteful because it's only and additional check at worst, though.

Going by the code it matches, it seems unlikely that it will create new matches by replacement, however: it's very strict about what it matches.

EDIT: To be more specific, it tries to match an opening angle bracket optionally followed by a slash followed by one of several keywords optionally followed by any number of symbols that are not a closing angle bracket and finally a closing angle bracket. If the input follows that syntax, it'll be swallowed whole. If it's malformed (e.g. multiple opening and closing angle brackets), it'll generate garbage until it can't find substrings matching the initial sequence anymore.

So, no. Unless you have code like <<iframe>iframe>, no repetition is necessary. But then you're dealing with a level of tag soup the regex isn't good enough for anyway (e.g. it will fail on < iframe> with the extra space).

EDIT2: It's also a bit odd that the pattern matches zero or more slashes at the beginning of the tag (it should be zero or one). And if my regex knowledge isn't too rusty, the final *+ doesn't make much sense either (the asterisk means zero or more, the plus means one or more, maybe it's a greedy syntax or something fancy like that?).

Alan
+2  A: 

On a completely unrelated subject, I would like to add a word on optimisation here.

preg_replace() can tell you whether a replacement has been made or not (see the 5th argument, which is passed by reference). It's far much efficient than comparing strings, especially if they are large.

Savageman