There is a mistake in this code, I could not find it. What is the missing character do I need?
preg_replace(/<(?!\/?(?:'.implode('|',$white).'))[^\s>]+(?:\s(?:(["''])(?:\\\1|[^\1])*?\1|[^>])*)?>/','',$html);
There is a mistake in this code, I could not find it. What is the missing character do I need?
preg_replace(/<(?!\/?(?:'.implode('|',$white).'))[^\s>]+(?:\s(?:(["''])(?:\\\1|[^\1])*?\1|[^>])*)?>/','',$html);
It looks like among other things you're missing a single quote:
preg_replace('/<(?!\/?(?:' . implode('|',$white) . '))[...
^
here!
Also, since the pattern contains single-quotes, those would also have to be escaped by preceding with backslash.
Alternatively you could also use heredoc syntax; this would not require any escaping of quotes in the pattern, and expressions can be embedded for expansion.
$pattern = <<<EOD
/pattern{embeddedExpression}morePattern/
EOD;
... preg_replace($pattern, ...)
Do yourself a favor and use DOM and XPath instead of regex to parse HTML to avoid problems.
Well, this part is wrong:
(["'])(?:\\\1|[^\1])*?\1
That's supposed to match a sequence enclosed in single- or double quotes, possibly including backslash-escaped quotes. But it won't work because backreferences don't work in character classes. The \1
is treated as the number 1
in octal notation, so [^\1]
matches any character except U+0001.
If it seems to work most of the time, it's because of the reluctant quantifier (*?
). The first alternative in (?:\\\1|[^\1])*?
correctly consumes an escaped quote, but otherwise it just matches any character, reluctantly, until it sees an unescaped quote. It works okay on well-formed text, but toss in an extra quote and it goes haywire.
The correct way to match "anything except what group #1 captured" is (?:(?!\1).)*
- that is, consume one character at a time, but only after the lookahead confirms that it's not the first character of the captured text. But I think you'll be better off dealing with each kind of quote separately; this regex is complicated enough as it is.
'~<(?!/?+(?:'.implode('|',$white).')\b)[^\s>]++(?:\s++'.
'(?:[^\'">]++|"(?:[^"\\]++|\\")*+"|\'(?:[^\'\\]++|\\\')*+\')*+)?+>~'
Notice the addition of the \b
(word boundary) after the whitelist alternation. Without that, if you have (for example) <B>
in your list, you'll unintentionally whitelist <BODY>
and <BLOCKQUOTE>
tags as well.
I also used possessive quantifiers (*+
, ++
, ?+
) everywhere, because the way this regex is written, I know backtracking will never be useful. If it's going to fail, I want it to fail as quickly as possible.
Now that I've told you how to get the regex to work, let me urge you not to use it. This job is too complex and too important to be done with such a poorly suited tool as regex. And if you really got that regex from a book on PHP security, I suggest you get your money back.