views:

238

answers:

3

My next web application project will make extensive use of Unicode. I usually use PHP and CodeIgniter however Unicode is not one of PHP's strong points.

Is there a PHP tool out there that can help me get Unicode working well in PHP?

Or should I take the opportunity to look into alternatives such as Python?

+4  A: 

PHP can handle unicode fine once you make sure to encode and decode on entry and exit. If you are storing in a database, ensure that the language encodings and charset mappings match up between the html pages, web server, your editor, and the database.

If the whole application uses UTF-8 everywhere, decoding is not necessary. The only time you need to decode is when you are outputting data in another charset that isn't on the web. When outputting html, you can use

htmlentities($var, ENT_QUOTES, 'UTF-8');

to get the correct output. The standard function will destroy the string in most cases. Same goes for mail functions too.

http://developer.loftdigital.com/blog/php-utf-8-cheatsheet is a very good resource for working in UTF-8

Ryaner
Actually you shouldn't encode/decode anything. It's much better to use the same charset throughout the application. You just have to pick something that has unicode capacity (Such as UTF-8)
troelskn
@troelskn: this still means that you have to check all incoming data if it is of this encoding (and usually you can't guarantee that).
Joachim Sauer
It's safe to assume that browsers send data back in the same encoding as the page was served in. I don't think that's what Ryaner meant though. (Eg. there is no reason to decode anything on output). In any case it's orthogonal to which encoding you use (Eg. it's also true if you use a single-byte encoding).
troelskn
+1  A: 

One of the Major feature of PHP 6 will be tightly integrated with UNICODE support.

Implementing UTF-8 in PHP 5. Since PHP strings are byte-oriented, the only practical encoding scheme for Unicode text is UTF-8. Tricks are [Got it from PHp Architect Magazine]:

  • Present HTML pages in UTF-8
  • Convert PHP scripts to UTF-8
  • Convert the site content, back-end databases and the like to UTF-8
  • Ensure that no PHP functions corrupt the UTF-8 text

Check out http://www.gravitonic.com/talks/
PHP UTF 8 Cheat Sheet

Webrsk
A: 

PHP is mostly unaware of chrasets and treats strings as bytestreams. That's not much of a problem really, but you'll have to do a bit of work your self.

The general rule of thumb is that you should use the same charset everywhere. If you use UTF-8 everywhere, then you're 99% there. Just make sure that you don't mix charsets, because then it gets really complicated. The only thing that won't work correct with UTF-8, is string manipulation, which needs to operate on a character level. Eg. strlen, substr etc. You should use UTF-8-aware versions in place of those. The multibyte-string extension gives you just that.

For a checklist of places where you need to make sure the charset is set correct, look at:

http://developer.loftdigital.com/blog/php-utf-8-cheatsheet

For more information, look at:

http://www.phpwact.org/php/i18n/utf-8

troelskn