ansaurus

Question

Answer 1

+5 A:

Even though this is not the usual "I want to parse HTML with regular expressions" situation, I would recommend using a DOM parser nevertheless, walk through each element, and abort if it is not in the list of allowed elements.

See e.g. this question to get started.

It could become almost a one-liner using a DOM parser extension like phpQuery if it supports the :not selector and multiple tag names - I don't know, have never worked with it myself, but it will be easy to find out. Basic examples are here.

Pekka 2010-08-27 09:52:00

Answer 2

A:

preg_match_all('/<([a-z]*)\b[^>]*>(.*?)</\1>/i'$html,$matches);

Breaking down the expression

The first / is the delimiter

the < is the start of the tag, the very first <

the ([a-z]*) starts to match a tag name so fir instance < strong

the \b[^>]* says once you found a space, keep looking for all words

the > says it want the previous section to keep looking until it finds the very first >

the (.*?) says keep on looking and COLLECT ( .. ) the string inside but becuse we have a ? then stop looking when you find the next char after the closing brace.

the </\1> says i want to match but only if the value inside is the same as the very first match, this is done by \1 as in match , the value of this would be what's found with([a-z]*)`.

then you can use preg_match_all to find all them with contents, the array output would be something like

array(
    0 > THE WHOLE TAG
    1 > TAG NAME
    2 > TAG VALUE
)

Hope it helps :)

Exmaple

$allowed = array('b','strong','i','pre','code'); WHITELIST, never blacklist
foreach($matchas as $match)
{
    if(!in_array($match[1],$allowed))
    {
        echo sprintf('The tag %s is disallowed!',$match[1]);
    }
}

RobertPitt 2010-08-27 10:12:13

So that would return all the tags in $html, that I could then check for unwanted tags?

YsoL8 2010-08-27 10:24:57

yea ill update with example.

RobertPitt 2010-08-27 10:40:07

Thanks!! Looks straightforward.

YsoL8 2010-08-27 10:45:35

also do take note of @pekka 's comment as using a DOM Parser will give you more stability.

RobertPitt 2010-08-27 10:49:13

-1 HTML is not regular... It may work, but it's not "right"...

ircmaxell 2010-08-27 10:58:39

For example, what will happen if you pass `<script type="text/javascript"<b>>alert('foo');</script</b>>`?

ircmaxell 2010-08-27 11:08:02

The OP, Originally asked specifically for preg_match, if he was looking for alternative ways I would of produced a comment explaining DOM, e.g Simple Dom, And also told the OP To take Note of Pekka's response as the DOM Would give him more stability, please read thoroughly before down voting....

RobertPitt 2010-08-27 11:17:37

why are people down rating ? yet it was accepted!

RobertPitt 2010-08-27 18:48:46

Answer 3

+2 A:

Regex is utterly unsuited to checking HTML for ‘safe’ tags. Not only that, but there are no safe tags in HTML. Any element can be given attributes that permit script injection (eg. onclick, style-with-IE-expression()...). You must check every attribute as well as every element.

When your security is at stake, you absolutely need a real HTML parser for this (then you filter elements/attributes and serialise the results). There are so many ways to evade regex-based checks it's not even funny.

You can use DOMDocument::loadHTML followed by a DOM walk to do this, or you could use an existing library such as htmlpurifier.

bobince 2010-08-27 12:05:11

ansaurus

tags:

views:

answers:

Pattern matching html tags

related questions