views:

941

answers:

6

How do I limit the types of HTML that a user can input into a textbox? I'm running a small forum using some custom software that I'm beta testing, but I need to know how to limit the HTML input. Any suggestions?

A: 

Once the text is submitted, you could strip any/all tags that don't match your predefined set using a regex in PHP.

It would look something like the following:

find open tag (<)
if contents != allowed tag, remove tag (from <..>)
warren
The important thing is to use a white list - look for the tags you will permit, rather than search for tags you don't want.
Ken Ray
A: 
  1. Parse the input provides and strip out all html tags that don't match exactly the list you are allowing. This can either be a complex regex, or you can do a stateful iteration through the char[] of the input string building the allowed input string and stripping unwanted attributes on tags like img.

  2. Use a different code system (BBCode, Markdown)

  3. Find some code online that already does this, to use as a basis for your implementation. For example Slashcode must perform this, so look for its implementation in the Perl and use the regexes (that I assume are there)

JeeBee
+2  A: 

You didn't state what the forum was built with, but if it's PHP, check out:

http://htmlpurifier.org/

Library Features: Whitelist, Removal, Well-formed, Nesting, Attributes, XSS safe, Standards safe

micahwittman
+2  A: 

i'd suggest a slightly alternative approach:

  • don't filter incoming user data (beyond prevention of sql injection). user data should be kept as pure as possible.
  • filter all outgoing data from the database, this is where things like tag stripping, etc.. should happen

keeping user data clean allows you more flexibility in how it's displayed. filtering all outgoing data is a good habit to get into (along the never trust data meme).

Owen
A: 

Regardless what you use, be sure to be informed of what kind of HTML content can be dangerous.

e.g. a < script > tag is pretty obvious, but a < style > tag is just as bad in IE, because it can invoke JScript commands.

In fact, any style="..." attribute can invoke script in IE.

< object > would be one more tag to be weary of.

scunliffe
A: 

PHP comes with a simple function strip_tag to strip HTML tags. It allows for certain tags to not be stripped.

Example #1 strip_tags() example

<?php
$text = '<p>Test paragraph.</p><!-- Comment --> <a href="#fragment">Other text</a>';
echo strip_tags($text);
echo "\n";

// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>

The above example will output:

Test paragraph. Other text
<p>Test paragraph.</p> <a href="#fragment">Other text</a>

Personally for a forum, I would use BBCode or Markdown because the amount of support and features provided such as live preview.

Seamus