views:

59

answers:

3

I am creating a web base application using PHP and MySQL. I want it to be able to save any kind of user input characters, both English and non-English characters like Arabic or Japanese at the same time.

What should I do to achieve that?

+1  A: 

For starters, make sure that you read up on SQL injection. You would need to take strong precautions so that you safely encode the input. Usually, you'd be filtering/discarding unsafe content. So if you really need to allow it, then you need to be careful that you don't make it easy to hack yourself.

Essentially, you need the same sort of protection, while allowing "dangerous" content such as source code examples, that sites like this one use. Also systems that are commonly targeted such as PHPBB2, WordPress, Wiki, etc..

I think your task is harder if the data needs to be searchable.

If you are using PHP, the mysql_real_escape_string() function looks good: http://www.tizag.com/mysqlTutorial/mysql-php-sql-injection.php Otherwise, use somethign similar.

Chris Thornton
thanks for the heads up chris. what about the character problem? currently it display ??? when user inputing a non english chars.
poer
@poer - you definitely need to use Unicode. The other responses are spot-on, especially the one linking to Joel's article.
Chris Thornton
+2  A: 

You need to use Unicode. Read the MySQL manual section on Unicode and Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

You'll likely want to set the character set (encoding) of the table/columns in question to utf8. You'll also need to set the encoding of your HTML/PHP files to UTF-8. You can do this with a meta tag in <head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

You can also set the Content-Type: header that Apache/PHP sends out.

Even after setting this, you may still run in to browser-specific issues. For example, Internet Explorer may not always use UTF-8, so Rails 3 had to put in a workaround.

Paul Schreiber
thank you paul.
poer
+2  A: 

For MySQL, you first need to define your data with the UTF8 character set:

CREATE DATABASE xx [...] DEFAULT CHARACTER SET 'utf8' DEFAULT COLLATE utf8_general_ci

And when creating database connections from PHP, you just need to run a quick command after opening it:

SET NAMES 'utf8'

Alternatively, if you have access to MySQL's my.ini, you can just add this to the config and forget the above:

skip-character-set-client-handshake
collation_server=utf8_unicode_ci
character_set_server=utf8

(note that's not php.ini, but MySQL's ini)


For PHP, if you need to manipulate multibyte strings: make sure you have the mbstring library active, and then change your string & regexp function calls to use the mb_* equivalent.

Also, make sure your editor is saving in UTF8 so everything's consistent. Eclipse/PDT makes it easy, at least (project -> properties -> text file encoding).


Finally, handling cultural differences: that's the hard part. Sometimes it's as easy as setting p { direction: rtl; } in CSS, and other times you'll be tearing your hair out trying to decipher what alphabet(s) a user just posted with. It depends on what you're doing with the different languages.

tadamson
thanks tadamson.
poer
poer
Generally the data will be converted to UTF8. If you were switching to say, US-ASCII, conversion might be an issue, but UTF8 can handle most anything you throw at it.
tadamson