views:

131

answers:

2

So, I have built on this system for quite some time, and it is currently outputting Latin1 (ISO-8859-1) to the web browser, and this is the components:

MySQL - all data is stored with the Latin1 character set

PHP - All PHP text files are stored on disk with Latin1 encoding

HTML - The output has the http-equiv="content-type" content="text/html; charset=iso-8859-1" meta tag

So, I'm trying to understand how the encoding of the different parts come into play in my workflow. If I open a PHP script and change its encoding within the text editor to UTF-8 and save it back to disk and reload the web browser, the text is all messed up - unless the text comes from the DB. If I change the encoding of the DB to UTF-8 and keep the PHP files in latin1 I have to use utf8_decode() for the data to display correctly. And if I change the HTML code the browser will read it incorrectly.

So yeah, I realise that if I want to "upgrade" to UTF8, I have to update all three parts of this setup for it to work correctly, but since it's a huge system with some 180k lines of PHP code and millions of posts in a lot of databases/tables, I don't want to start something like this without understanding everything correctly.

What haven't I thought about? What could mess this up beyond fixing? What are the procedures for changing the encoding of an entire MySQL installation and what's the easiest way to change the encoding of hundreds or thousands of PHP files on disk?

The META tag is luckily added dynamically, so I'll change that in one place only :)

Let me hear about your experiences with this.

+1  A: 

It's tricky.

You have to:

  • change the DB and every table character set/encoding – I don't know much about MySQL, but see here
  • set the client encoding to UTF-8 in PHP (SET NAMES UTF8) before the first query
  • change the meta tag and possible the Content-type header (note the Content-type header has precedence)
  • convert all the PHP files to UTF-8 w/out BOM – you can easily do that with a loop and iconv.
  • the trickiest of all: you have to change most of your string function calls. Than means mb_strlen instead of strlen, mb_substr instead of substr and $str[index], etc.
Artefacto
DB - check, client encoding - you mean when interfacing with the MySQL server through PHP? meta tag - check, PHP files - check, PHP functions... Uh, ok. While I don't use strlen and substr all that much - what about that $str[index]? Do you mean that while writing in a UTF8-encoded PHP file, I can't write <? print $foo["Översrift"] ?> Presumably, the string is sent to the PHP interpreter as UTF8 data and the saved indexed data should be identical, no?
Sandman
As long as there is no data coming from elsewhere indeed the $foo["Översrift"] would keep on working provided all files are converted to utf-8.
Wrikken
@Sandman yes I mean when interfacing with the MySQL server through PHP. What I mean by `$str[index]` is stuff like `$str[0]` (index is an integer). For instance, you cannot use `$str[0]` to get the first character because UTF-8 is a multi-byte encoding; if the first character takes more than 1 byte (which is the case for all non-ASCII characters), `$str[0]` will get only the first byte of the character. There any many other cases – the majority of functions that operate on strings will have to be modified.
Artefacto
Right, then I'm with you. I'd never use $str[index] that way :)
Sandman
A: 

Don't convert to UTF8 if you don't have to. Its not worth the trouble.
UTF8 is (becoming) the new standard, so for new projects I can recommend it.

Functions
Certain function calls don't work anymore. For latin1 it's:

 echo htmlentities($string);

For UTF8 it's:

 echo htmlentities($string, ENT_COMPAT, 'UTF-8');

strlen(), substr(), etc. Aren't aware of the multibyte characters.

MySQL
mysql_set_charset('UTF8') or mysql_query('SET NAMES UTF8') will convert all text to UTF8 coming from the database(SELECTs). It will also convert incoming strings(INSERT, UPDATE) from UTF8 to the encoding of the table.

So for reading from a latin1 table it's not necessary to convert the table encoding.
But certain characters are only available in unicode (like the snowman ☃, iPhone emoticons, etc) and can't be converted to latin1. (The data will be truncated)

Scripts
I try to prevent specials-characters in my php-scripts / templates.
I use the &euml; notation instead of ë etc. This way it doesn't matter if is saved in latin1 or utf8.

Bob Fanger
MySQL tables would not have to be converted as long as what you're saving is available in their current character set. However, if it's not (and that's no small possibility when going latin1 => utf8), they should be converted (ALTER TABLE foo SET CHARACTER SET utf8), possibly columns by themselves if they have been separately set.
Wrikken
No, if you change the encoding for the connection the mysql server/client will convert it on-the-fly.
Bob Fanger
I use it if I need to generate a ms-excel csv-file. Tables are in UTF8 and after a `SET NAMES lantin1` i can write to the csv-file without a single utf_decode()
Bob Fanger
@Bob Fanger: think about writes to table, not reads. Yes, conversion is attempted, but putting utf-8 in latin1 is simply not always possible, or am I mistaken? If the character sets overlap 100%, why use the one over the other?
Wrikken
@Wrikken You're not mistaken. Obviously you cannot put in a latin1 column characters that are not in latin1 like ى.
Artefacto
@Artefacto @Wrikken Valid point, I updated the anwser.
Bob Fanger