views:

737

answers:

2

About 2 years ago I made the mistake of starting a large website using iso-8859-1. I now am having issues with some characters, especially when sending data to the server using ajax. Because of this, I would like to switch to using UTF-8.

What issues do you see coming from this? I know I would have to search the site to look for characters that need to be changed from a ? to their real characters. But, are there any other risks in doing this? Has anyone done this before?

+3  A: 

The main difficulty is making sure you've checked that all the data paths are UTF-8 clean:

  1. Is your site DB-backed? If so, you'll need to convert all the tables to UTF-8 or some other Unicode encoding, so sorting and text searching work correctly.

  2. Is your site using some programming language for dynamic content? (PHP, mod_perl, ASP...?) If so, you'll have to make sure the particular language interpreter you're using fully understands some form of Unicode, work out the conversions if it isn't using UTF-8 natively -- UCS-2 is next most common -- and check that it's configured to use UTF-8 on its output to the web server.

  3. Does your site have some kind of back-end app server? Does it use UTF-8?

  4. EDIT: There are at least three different places you can declare the charset for a web document. Be sure you change them all:

    • the HTTP Content-Type header
    • the <meta http-equiv="Content-Type"> tag in your documents' <head>
    • the <?xml> tag at the top of the document, if using XHTML Strict

All this comes from my experiences a few years ago when I traced some Unicode data through a moderately complex N-tier app, and found conversion chains like:

Latin-1 -> UTF-8 -> Latin-1 -> UTF-8

Much of this was due to the less mature Unicode support at the time, but you can still find yourself messing with ugliness like this if you're not careful to make the pipeline UTF-8 clean.

As for your comments about searching out Latin-1 characters and converting files one by one, I wouldn't do that. I'd build a script around the iconv utility found on every modern Linux system, feeding in every text file in your system, explicitly converting it from Latin-1 to UTF-8. Leave no stone unturned.

Warren Young
We are using a CMS, written in PHP that handles the encoding. It is running on PostgreSQL. In the CMS, I can just switch the encoding, which will then change the content type headers in all pages...
Nic Hubbard
I'd bet that just changes the charset the CMS declares it's using to mod_php, which controls what Apache reports to the browser. Certainly I wouldn't expect it to magically migrate all the data in your DB. It probably won't convert existing templates the CMS uses to build pages.Bottom line: test, test, test. Put some characters in the DB that come from outside the Latin-1 set, and see if they survive to the browser. If so, then check to be sure you don't have any redundant conversions like I showed above. If not, something is still smashing UTF-8 to Latin-1.
Warren Young
Thought of another risk area. Added it to the numbered list above.
Warren Young
Looks like my DB is encoded as SQL_ASCII. Do I need to change this to UTF-8 as well, or can I leave this?
Nic Hubbard
It's not going to matter from a pure data storage and retrieval standpoint. But, if the CMS relies on the database to do sorting and text searching, it *does* matter that the DB knows the character encoding. Maybe flipping this switch in the CMS updates all the tables for you. Don't count on it: check.
Warren Young
+2  A: 

Such a change touches (nearly) every part of your system. You need to go through everything, from the database to the PHP to the HTML to the web browser.

Start a test site and subject it to some serious testing (various browsers on various platforms doing various things).

IMO it's important to actually get familiar with UTF-8 and what it means for software. A few quick points:

  • PHP is mostly byte-oriented. Learn the difference between characters and code points and bytes, and between UTF-8 and Unicode.
  • UTF-8 is well-designed. For instance, given two UTF-8 strings, a byte-oriented strstr() will still function correctly.
  • The most common problem is treating a UTF-8 string as ISO-8859-1 and vice versa - you may need to add documentation to your functions stating what kind of encoding they expect, to make these sorts of errors less likely. A variable naming convention for your strings (to indicate what encoding they use) may also help.
Artelius