views:

475

answers:

9

I am just looking into using HTML Purifier to ensure that a user-inputed string (that represents the name of a person) is sanitized.

I do not want to allow any html tags, script, markup etc - I just want the alpha, numeric and normal punctuation characters.

The sheer number of options available for HTML Purifier is daunting and, as far as i can see, the docs do not seem to have a beggining/middle or end

see: http://htmlpurifier.org/docs

Is there a simple hello world tutorial online for HTML Purifier that shows how to sanitize a string removing all the bad stuff out of it.

I am also considering just using strip tags:

or PHP's in built data sanitizing

A: 

The easiest way to remove all non-alphanumeric characters from a string i think is to use RegEx.Replace() as follows:

Regex.Replace(stringToCleanUp, "[\W]", "");

While \w (lowercase) matches any ‘word’ character, equivalent to [a-zA-Z0-9_] \W matches any ‘non-word’ character, ie. anything NOT matched by \w. The code above will use \W (uppercase) and replace the findings with nothing.

As an alternative if you don’t want to allow the underscore you can use [^a-zA-Z0-9], like this:

Regex.Replace(stringToCleanUp, "[^a-zA-Z0-9]", "");

omadmedia
Thanks for these Mikulas Dite and omadmedia. I will probably add some regi into the mix. However i still would like to know if there is a hello world tutorial for HTML Purifier. I guess someone would have pointed to one by now if there was.
JW
A: 

You should do input validation based on the content - for example rather use some regexp for name

'/([A-Z][a-z]+[ ]?)+/' //ascii only, but not problematic to extend

this validation should do the job well. And then escape the output when printing it on page, with preferred htmlspecialchars.

Mikulas Dite
A: 

If you are trying to evade code injection attacks, just scape the data and store and print it like the user entered.

For example: If you want to avoid SQL Injection problems in MySQL, use the mysql_real_escape_string() function or similar to sanitize the SQL sentence. *

Another example: Writing data to a HTML document, parse the data with html_entities(), so the data will appears like enter by the user.

fjfnaranjo
thanks. but no, it not quite what i wanted. the main thing i am looking for is to strip all markup and scripts from user input leaving alpha, numeric and grammatical characters. allow '<' , disallow '<a></a>'. allow '>' disallow '<?php ... ?>' etc
JW
A: 

You can use someting like htmlspecialchars() to preserve the characters the user typed in without the browser interpreting.

NeuroScr
A: 

i usually clean all user input before sending to my database with the following

mysql_reql_escape_string( htmlentities( strip_tags($str) ));
David Morrow
A: 

For simplicity, you can either use strip_tags(), or replace occurrences of <, >, and & with &lt;, &gt;, and &amp;, respectively. This definitely isn't the best solution, but the quickest.

Propeng
+1  A: 

Found this a week ago... LOVE it.

"A simple PHP HTML DOM parser written in PHP5+, supports invalid HTML, and provides a very easy way to handle HTML elements." http://simplehtmldom.sourceforge.net/

// Example
$html = str_get_html("<div>foo <b>bar</b></div>");
$e = $html->find("div", 0);

echo $e->tag; // Returns: " div"
echo $e->outertext; // Returns: " <div>foo <b>bar</b></div>"
echo $e->innertext; // Returns: " foo <b>bar</b>"
echo $e->plaintext; // Returns: " foo bar"

You can also loop through and remove individual tags, etc. The docs and examples are pretty good... I found it easy to use in quite a few places. :-)

This seems like it might fit the bill. I hadn't considered using a DOM parser for this - but it makes sense. Thanks for the tip.
JW
A: 

Hi, I recommend this: http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_html_tags_web_page

It will remove all HTML tags. I also recommend checking the whole linked article, which explains how to combine this with other functions to end up with a clean, UTF8 text.

/**
* Remove HTML tags, including invisible text such as style and
* script code, and embedded objects.  Adds line breaks around
* block-level tags to prevent word joining after tag removal.
* 
* PHP's strip_tags( ) function will remove the tags, but it forgets to remove
* styles, scripts, and other unwanted text between the tags. When it removes 
* the tags it also joins together the words before and after the tags. 
* For block-level tags, like <p>, this is the wrong thing to do.
* 
* From:
* http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_html_tags_web_page
*
* @param string $html
* @return string Clean of all kind of tags
*/
function strip_html_tags( $html )
{
 $text = preg_replace(
  array(
   // Remove invisible content
   '@<head[^>]*?>.*?</head>@siu',
   '@<style[^>]*?>.*?</style>@siu',
   '@<script[^>]*?.*?</script>@siu',
   '@<object[^>]*?.*?</object>@siu',
   '@<embed[^>]*?.*?</embed>@siu',
   '@<applet[^>]*?.*?</applet>@siu',
   '@<noframes[^>]*?.*?</noframes>@siu',
   '@<noscript[^>]*?.*?</noscript>@siu',
   '@<noembed[^>]*?.*?</noembed>@siu',
   // Add line breaks before and after blocks
   '@</?((address)|(blockquote)|(center)|(del))@iu',
   '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
   '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
   '@</?((table)|(th)|(td)|(caption))@iu',
   '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
   '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
   '@</?((frameset)|(frame)|(iframe))@iu',
   ),
  array(
   ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
   "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
   "\n\$0", "\n\$0",
   ),
   $html );
 $text = strip_tags( $text );
 $text = ltrim(rtrim($text));
 return $text;
}

This will convert something like:

<p><b>Welcome</b> to my <a href="example.com">homepage</a></p>

Into

Welcome to my homepage

Rafa
A: 

I've always thought Codeigniter's xss cleaning class was quite good, but more recently I've turned to Kohana.

Have a look at their xss_clean method

http://github.com/kohana/core/blob/c443c44922ef13421f4a3af5b414e19091bbdce9/classes/kohana/security.php

Andrei Serdeliuc