views:

59

answers:

3

Hello

I'm new to pattern matching, having finally figured it out. I am stuck trying to find an approach to the following problem.

I need to return a match (with php preg_match) if any of a number html tags are present.

<p></p>
<br>
<h1></h1>
<h2></h2>

And return no match match otherwise. So anything not in the above list fails, e.g:

<script></script>
<table></table>

ect

...And ideally I want to operate a white list of safe tags if possible.

Anyone know a pattern that I can use/adapt?

+5  A: 

Even though this is not the usual "I want to parse HTML with regular expressions" situation, I would recommend using a DOM parser nevertheless, walk through each element, and abort if it is not in the list of allowed elements.

See e.g. this question to get started.

It could become almost a one-liner using a DOM parser extension like phpQuery if it supports the :not selector and multiple tag names - I don't know, have never worked with it myself, but it will be easy to find out. Basic examples are here.

Pekka
A: 
preg_match_all('/<([a-z]*)\b[^>]*>(.*?)</\1>/i'$html,$matches);

Breaking down the expression

The first / is the delimiter

the < is the start of the tag, the very first <

the ([a-z]*) starts to match a tag name so fir instance < strong

the \b[^>]* says once you found a space, keep looking for all words

the > says it want the previous section to keep looking until it finds the very first >

the (.*?) says keep on looking and COLLECT ( .. ) the string inside but becuse we have a ? then stop looking when you find the next char after the closing brace.

the </\1> says i want to match but only if the value inside is the same as the very first match, this is done by \1 as in match , the value of this would be what's found with([a-z]*)`.

then you can use preg_match_all to find all them with contents, the array output would be something like

array(
    0 > THE WHOLE TAG
    1 > TAG NAME
    2 > TAG VALUE
)

Hope it helps :)

Exmaple

$allowed = array('b','strong','i','pre','code'); WHITELIST, never blacklist
foreach($matchas as $match)
{
    if(!in_array($match[1],$allowed))
    {
        echo sprintf('The tag %s is disallowed!',$match[1]);
    }
}
RobertPitt
So that would return all the tags in $html, that I could then check for unwanted tags?
YsoL8
yea ill update with example.
RobertPitt
Thanks!! Looks straightforward.
YsoL8
also do take note of @pekka 's comment as using a DOM Parser will give you more stability.
RobertPitt
-1 HTML is not regular... It may work, but it's not "right"...
ircmaxell
For example, what will happen if you pass `<script type="text/javascript"<b>>alert('foo');</script</b>>`?
ircmaxell
The OP, Originally asked specifically for preg_match, if he was looking for alternative ways I would of produced a comment explaining DOM, e.g Simple Dom, And also told the OP To take Note of Pekka's response as the DOM Would give him more stability, please read thoroughly before down voting....
RobertPitt
why are people down rating ? yet it was accepted!
RobertPitt
+2  A: 

Regex is utterly unsuited to checking HTML for ‘safe’ tags. Not only that, but there are no safe tags in HTML. Any element can be given attributes that permit script injection (eg. onclick, style-with-IE-expression()...). You must check every attribute as well as every element.

When your security is at stake, you absolutely need a real HTML parser for this (then you filter elements/attributes and serialise the results). There are so many ways to evade regex-based checks it's not even funny.

You can use DOMDocument::loadHTML followed by a DOM walk to do this, or you could use an existing library such as htmlpurifier.

bobince