ansaurus

Question

RegEx: HTML whitelist

Answer 1

A:

Assuming PCRE, use (?!elements) instead of (elements).

chaos 2009-02-05 04:51:14

Halfway there. It still matches the closing tag. e.g. <strong>test</strong> returns <strong>test

Zurahn 2009-02-05 06:34:10

Answer 2

+3 A:

Do NOT try parsing with regular expressions

Instead use a real parser

grom 2009-02-05 05:49:48

This is not meant as an actualy production implementation, but as a learning experiment.

Zurahn 2009-02-05 05:55:08

Fair enough, just had to post this as warning to others.

grom 2009-02-05 06:34:06

Answer 3

+1 A:

/<(.|\n)*?>/g

matches all HTML tags pairs including attributes in the tags

Exclude tags strong and em

(?!strong|em)

matches all HTML tags pairs but strong and em

<((?!strong|em).|\n)*?>

unigogo 2009-02-05 07:09:08

"<((?!strong|em).|\n)*?>" doesn't quite work: it won't match any tags that start with <strong or <em. For example: <stronger>test</stronger> and <embark>test</embark> aren't returned as matches.

Chris 2009-04-25 00:20:41

Answer 4

+2 A:

Don't use regex for parsing [X]HTML.

Doubly especially definitely NEVER use regex for parsing [X]HTML as a security measure.

An HTML parser (or tidier followed by an XML parser) is the only workable approach for whitelisting.

/<(.|\n)*?>/g matches all HTML tags pairs including attributes in the tags

No.

<a href=">" onmouseover="attackCode()">

and a thousand other possibilities, both valid and malformed-but-the-browser-will-still-understand-it.

bobince 2009-02-05 12:39:20

ansaurus

tags:

views:

answers:

RegEx: HTML whitelist

related questions