ansaurus

Question

Routine for removing ALL junk from incoming strings?

Answer 1

+3 A:

You want to check out PHP's utf_decode function: Converts a string with ISO-8859-1 characters encoded with UTF-8 to single-byte ISO-8859-1. It seems you're getting UTF characters and the database is not able to handle those.

Another solution is to change the encoding of the database, if possible.

Nerdling 2009-02-27 19:28:18

Just to ensure I understand, would changing the encoding of the database automatically cause these characters to be converted or just not allowed? It is obviously a result of copy/pasting from various sources but definitely needs to be "fixed."

Nicholas Kreidberg 2009-02-27 19:46:03

Any values already in the database will not be seen as their UTF characters. You'll want to repopulate the database after the change or have a script that goes through and updates them.

Nerdling 2009-02-27 20:30:31

Having had a similar experience with a legacy system, you will probably save a lot of time updating your db now. I agree with bobince, update everything to UTF8.

GloryFish 2009-02-27 21:22:42

Answer 2

+6 A:

didnâ€™t,â€œ for beginning quotes and â€ for end quote

That's not junk, those are legitimate “smart quote” characters that have been passed to you encoded as UTF-8, but read, incorrectly, as ISO-8859-1.

You can try to get rid of them or try to parse them into plain old Latin-1 using utf_decode, but if you do you'll have an application that won't let you type anything outside ASCII, which in this day and age is a pretty poor show.

Better if you can manage it is to have all your pages served as UTF-8, all your form submissions coming in as UTF-8, and all your database contents stored as UTF-8. Ideally, your application would work internally with all Unicode characters, but unfortunately PHP as a language doesn't have native Unicode strings, so it's usually a case of holding all your strings also as UTF-8, and taking the risk of occasionally truncating a UTF-8 sequence and getting a �, unless you want to grapple with mbstring.

$data = pg_escape_string($data); //escapes a string for insertion into the database

$data = strip_tags($data); //strips HTML and PHP tags from a string

You don't want to do that as a sanitisation measure coming into your application. Keep all your strings in plain text form for handling them, then pg_escape_string() only on the way out to a Postgres query, and htmlspecialchars() only on the way out to an HTML page.

Otherwise you'll get weird things like SQL escapes appearing on variables that have passed straight through the script to the output page, and no-one will be able to use a plain less-than character.

One thing you can usefully do as a sanitisation measure is to remove any control codes in strings (other than newlines, \n, which you might conceivably want).

$data= preg_replace('/[\x00-\x09\x0B-\x19\x7F]/', '', $data);

bobince 2009-02-27 20:55:58

This approach in addition to some str_replace (for special cases) has worked pretty well. Its not flawless but definitely better. Thank you.

Nicholas Kreidberg 2009-03-10 17:03:42

Answer 3

A:

Zend Framework's Zend_Filter and Zend_Filter_Input has very good tools for this.

raspi 2009-02-28 08:34:17

Answer 4

A:

I finally came up with a routine for replacing these characters. It took parsing some of the problematic strings one character at a time and returning the octal value of each character. In doing so I learned that smart quote characters come back as sets of 3 octal values. Here is routine I used to parse the string:

$str = "string_with_smart_quote_chars";

$ilen = strlen($str);
$sords = NULL;

echo "$str\n\n";

for($i=0; $i<$ilen; $i++)
{
    $sords .= ord(substr($str, $i, 1))."  ";
}

echo "$sords\n\n";

Here are the str_replace() calls to "fix" the string:

$str = str_replace(chr(226).chr(128).chr(156), '"', $str); // start quote
$str = str_replace(chr(226).chr(128).chr(157), '"', $str); // end quote
$str = str_replace(chr(226).chr(128).chr(153), "'", $str); // for single quote

I am going to continue building up an array of these search/replacements which I am sure will continue to grow with the increasing use of these types of characters.

I know that there are some canned routines for replacing these but I had no luck with any of them on the Solaris 10 platform that my scripts are running on.

-- Nicholas

Nicholas Kreidberg 2009-04-03 20:15:05

ansaurus

tags:

views:

answers:

Routine for removing ALL junk from incoming strings?

related questions