views:

307

answers:

4

Which is the best way to "sanitize" content? An example...

Example - Before sanitize:

Morbi mollis ante vitae massa suscipit a tempus est pellentesque. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Nulla mattis iaculis consectetur.
Morbi mollis ante vitae est pellentesque. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Nulla mattis iaculis consectetur.

Example - After sanitize:

<p>Morbi mollis ante vitae massa suscipit a tempus est pellentesque. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Nulla mattis iaculis consectetur.</p>

<p>Morbi mollis ante vitae est pellentesque. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Nulla mattis iaculis consectetur.</p>

What it should do

  • It should add p-tags instead of line break like.
  • It should remove empty space like tripple spaces
  • It should remove double line breaks.
  • It should remove tabs.
  • It should remove line breaks and spaces before the content if any.
  • It should remove line breaks and spaces after the content if any.

Right know I use the str_replace function and it should be a better solution for this?

I want the function to look like this:

function sanitize($content)
{
    // Do the magic!
    return $content;
}
+3  A: 

Take a look at Sanitize class of CakePHP.

Enrico Carlesso
What a useless class.
Col. Shrapnel
+4  A: 
function sanitize($content) {
  // leading white space
  $content = preg_replace('!^\s+!m', '', $content);

  // trailing white space
  $content = preg_replace('![ \t]+$!m', '', $content);

  // tabs and multiple white space
  $content = preg_replace('![ \t]+!', ' ', $content);  

  // multiple newlines
  $content = preg_replace('![\r\n]+!', "\n", $content);

  // paragraphs
  $content = preg_replace('!(.+)!m', '<p>$1</p>', $content);

  // done
  return $content;
}

Example:

$s = <<<END
Morbi mollis ante vitae massa suscipit a tempus est pellentesque. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Nulla mattis iaculis consectetur.
Morbi mollis ante vitae est pellentesque. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Nulla mattis iaculis consectetur.
END;

$out = sanitize($s);

Output:

<p>Morbi mollis ante vitae massa suscipit a tempus est pellentesque. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Nulla mattis iaculis consectetur.</p> 
<p>Morbi mollis ante vitae est pellentesque. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Nulla mattis iaculis consectetur.</p>
cletus
Won't most of these need `s` modifiers indicating they should match against more than one line?
LeguRi
@Richard the `s` (`DOTALL`) modifier only affects what `.` matches (whether or not it matches newlines). The only expression that uses `.` is the last one and I'm taking advantage of it not matching newlines so no, the `s` modifier isn't required anywhere.
cletus
How would this treat the following? $s = "<script>alert('owned');</script>"; echo sanitize($s);
thomasrutter
Not that cletus' answer is wrong (it's probably the most literal answer to the original question), just that the original question asker may not have considered everything he may need to sanitize. :)
thomasrutter
+5  A: 
  • It should add p-tags instead of line break like.

Run it through something like the Textile interpreter, or Markdown, or any another humane markup language which suits your needs.

  • It should remove empty space like tripple spaces
  • It should remove double line breaks.
  • It should remove tabs.
  • It should remove line breaks and spaces before the content if any.
  • It should remove line breaks and spaces after the content if any.

Why bother? When HTML is rendered as a document, multiple white space characters are reduced to a single space, no? Most of your problems solve themselves.

LeguRi
+1  A: 

Tidy!!

There is a pretty outdated article on zend, but check out the example they give:

http://devzone.zend.com/article/761

WishCow