views:

478

answers:

4

Sometimes when a user is copying and pasting data into an input form we get characters like the following:

didn’t,“ for beginning quotes and †for end quote, etc ...

I use this routine to sanitize most input on web forms (I wrote it a while ago but am also looking for improvements):

function fnSanitizePost($data) //escapes,strips and trims all members of the post array
{
    if(is_array($data))
    {
    $areturn = array();
    foreach($data as $skey=>$svalue)
    {
      $areturn[$skey] = fnSanitizePost($svalue);
    }
    return $areturn;
  }
  else
    {
      if(!is_numeric($data))
     {
      //with magic quotes on, the input gets escaped twice, which means that we have to strip those slashes. leaving data in your database with slashes in them, is a bad idea
      if(get_magic_quotes_gpc()) //gets current configuration setting of magic quotes
      {
        $data = stripslahes($data);
      }
        $data = pg_escape_string($data); //escapes a string for insertion into the database
        $data = strip_tags($data);  //strips HTML and PHP tags from a string
      }
     $data = trim($data);  //trims whitespace from beginning and end of a string
      return $data;
    }
}

I really want to avoid characters like I mention above from ever getting stored in the database, do I need to add some regex replacements in my sanitizing routine?

Thanks,

- Nicholas

+3  A: 

You want to check out PHP's utf_decode function: Converts a string with ISO-8859-1 characters encoded with UTF-8 to single-byte ISO-8859-1. It seems you're getting UTF characters and the database is not able to handle those.

Another solution is to change the encoding of the database, if possible.

Nerdling
Just to ensure I understand, would changing the encoding of the database automatically cause these characters to be converted or just not allowed? It is obviously a result of copy/pasting from various sources but definitely needs to be "fixed."
Nicholas Kreidberg
Any values already in the database will not be seen as their UTF characters. You'll want to repopulate the database after the change or have a script that goes through and updates them.
Nerdling
Having had a similar experience with a legacy system, you will probably save a lot of time updating your db now. I agree with bobince, update everything to UTF8.
GloryFish
+6  A: 

didn’t,“ for beginning quotes and †for end quote

That's not junk, those are legitimate “smart quote” characters that have been passed to you encoded as UTF-8, but read, incorrectly, as ISO-8859-1.

You can try to get rid of them or try to parse them into plain old Latin-1 using utf_decode, but if you do you'll have an application that won't let you type anything outside ASCII, which in this day and age is a pretty poor show.

Better if you can manage it is to have all your pages served as UTF-8, all your form submissions coming in as UTF-8, and all your database contents stored as UTF-8. Ideally, your application would work internally with all Unicode characters, but unfortunately PHP as a language doesn't have native Unicode strings, so it's usually a case of holding all your strings also as UTF-8, and taking the risk of occasionally truncating a UTF-8 sequence and getting a �, unless you want to grapple with mbstring.

$data = pg_escape_string($data); //escapes a string for insertion into the database

$data = strip_tags($data); //strips HTML and PHP tags from a string

You don't want to do that as a sanitisation measure coming into your application. Keep all your strings in plain text form for handling them, then pg_escape_string() only on the way out to a Postgres query, and htmlspecialchars() only on the way out to an HTML page.

Otherwise you'll get weird things like SQL escapes appearing on variables that have passed straight through the script to the output page, and no-one will be able to use a plain less-than character.

One thing you can usefully do as a sanitisation measure is to remove any control codes in strings (other than newlines, \n, which you might conceivably want).

$data= preg_replace('/[\x00-\x09\x0B-\x19\x7F]/', '', $data);
bobince
This approach in addition to some str_replace (for special cases) has worked pretty well. Its not flawless but definitely better. Thank you.
Nicholas Kreidberg
A: 

Zend Framework's Zend_Filter and Zend_Filter_Input has very good tools for this.

raspi
A: 

I finally came up with a routine for replacing these characters. It took parsing some of the problematic strings one character at a time and returning the octal value of each character. In doing so I learned that smart quote characters come back as sets of 3 octal values. Here is routine I used to parse the string:

$str = "string_with_smart_quote_chars";

$ilen = strlen($str);
$sords = NULL;

echo "$str\n\n";

for($i=0; $i<$ilen; $i++)
{
    $sords .= ord(substr($str, $i, 1))."  ";
}

echo "$sords\n\n";

Here are the str_replace() calls to "fix" the string:

$str = str_replace(chr(226).chr(128).chr(156), '"', $str); // start quote
$str = str_replace(chr(226).chr(128).chr(157), '"', $str); // end quote
$str = str_replace(chr(226).chr(128).chr(153), "'", $str); // for single quote

I am going to continue building up an array of these search/replacements which I am sure will continue to grow with the increasing use of these types of characters.

I know that there are some canned routines for replacing these but I had no luck with any of them on the Solaris 10 platform that my scripts are running on.

-- Nicholas

Nicholas Kreidberg