tags:

views:

304

answers:

4

Being weak on regular expressions, I've been working with them to improve. One concept I've been trying to do is to remove all HTML elements except for a list of allowed ones.

I've managed to do the reverse -- remove a specified list of elements:

<\/?(strong|em|a)[^>]*>

However I want the opposite, and remove every element but.

A: 

Assuming PCRE, use (?!elements) instead of (elements).

chaos
Halfway there. It still matches the closing tag. e.g. <strong>test</strong> returns <strong>test
Zurahn
+3  A: 

Do NOT try parsing with regular expressions

Instead use a real parser

grom
This is not meant as an actualy production implementation, but as a learning experiment.
Zurahn
Fair enough, just had to post this as warning to others.
grom
+1  A: 
/<(.|\n)*?>/g

matches all HTML tags pairs including attributes in the tags

Exclude tags strong and em

(?!strong|em)

matches all HTML tags pairs but strong and em

<((?!strong|em).|\n)*?>
unigogo
"<((?!strong|em).|\n)*?>" doesn't quite work: it won't match any tags that start with <strong or <em. For example: <stronger>test</stronger> and <embark>test</embark> aren't returned as matches.
Chris
+2  A: 

Don't use regex for parsing [X]HTML.

Doubly especially definitely NEVER use regex for parsing [X]HTML as a security measure.

An HTML parser (or tidier followed by an XML parser) is the only workable approach for whitelisting.

/<(.|\n)*?>/g matches all HTML tags pairs including attributes in the tags

No.

<a href=">" onmouseover="attackCode()">

and a thousand other possibilities, both valid and malformed-but-the-browser-will-still-understand-it.

bobince