preg_match_all('/<([a-z]*)\b[^>]*>(.*?)</\1>/i'$html,$matches);
Breaking down the expression
The first /
is the delimiter
the <
is the start of the tag, the very first <
the ([a-z]*)
starts to match a tag name so fir instance < strong
the \b[^>]*
says once you found a space, keep looking for all words
the >
says it want the previous section to keep looking until it finds the very first >
the (.*?)
says keep on looking and COLLECT ( .. ) the string inside but becuse we have a ?
then stop looking when you find the next char after the closing brace.
the </\1>
says i want to match but only if the value inside is the same as the very first match, this is done by \1
as in match , the value of this would be what's found with
([a-z]*)`.
then you can use preg_match_all to find all them with contents, the array output would be something like
array(
0 > THE WHOLE TAG
1 > TAG NAME
2 > TAG VALUE
)
Hope it helps :)
Exmaple
$allowed = array('b','strong','i','pre','code'); WHITELIST, never blacklist
foreach($matchas as $match)
{
if(!in_array($match[1],$allowed))
{
echo sprintf('The tag %s is disallowed!',$match[1]);
}
}