views:

126

answers:

4

I know there is a lot of discussion for years on best methods of filtering data with PHP but I would like to go the whitelist approach in my current project.

I only want a user to be able to use the following HTML

<b>bold</b>
<i>italics</i>
<u>underline</u>
<s>strikethrough</s>
<big>Big size</big >
<small>Small size</small>

Hyperlink <a href="http://www.site.com"&gt;website&lt;/a&gt;

A Bulleted List:
<ul>
<li>One Item</li>
<li>Another Item</li>
</ul>

An Ordered List:
<ol>
<li> First Item</li>
<li> Second Item</li>
</ol>

<blockquote>Because it is indented</blockquote>

<h1>Heading 1</h1>
<h2>Heading 2</h2>
<h3>Heading 3</h3>

Can anyone show me the best method of doing this for performance in PHP? I have only in the past allowed all html minus certain codes

+6  A: 

I believe the HTML Purifier Library will work nicely:

http://htmlpurifier.org/

HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications. Tired of using BBCode due to the current landscape of deficient or insecure HTML filters? Have a WYSIWYG editor but never been able to use it? Looking for high-quality, standards-compliant, open-source components for that application you're building? HTML Purifier is for you!

gahooa
this is nice but very very bulky, in fact it's HUGE
jasondavis
Well yes, filtering HTML is actually a very hard job.
bobince
+1  A: 

The simplest solution would be strip_tags(), which accepts a second argument containing allowable tags:

strip_tags($string, "<b><i><u><a><s><big><small><ul><li><ol><blockquote><h1><h2><h3>");
Mark
that looks nice if it works well
jasondavis
It's no good. strip_tags is a simplistic approach to a difficult problem, which has always had many workarounds to get bad content in. Even if it were bug-free, the lack of attribute filtering leaves you no way to disallow harmful constructs like `<a onmouseover="do_script_injection">`.
bobince
+1  A: 

Another route is using strip_tags with the second argument.

http://php.net/manual/en/function.strip-tags.php

Galen
+1  A: 

I would run the submitted code through tidy to normalize it first, and then use xpath or apply xslt to only select allowed elements. This way, nothing can leak. Do bear in mind, too, that in any given website situation you would probably have thousands if not hundreds of thousands of read requests for every write request [that uses tidy and xpath/xslt] so on average the performance impact is negligible. If you are doing batch processing on the other hand..

Edit: oh and: DON'T do this with regular expressions. It is mathematically impossible to do it correctly.

mst