ansaurus

Question

Change Website Character encoding from iso-8859-1 to UTF-8

Answer 1

+3 A:

The main difficulty is making sure you've checked that all the data paths are UTF-8 clean:

Is your site DB-backed? If so, you'll need to convert all the tables to UTF-8 or some other Unicode encoding, so sorting and text searching work correctly.
Is your site using some programming language for dynamic content? (PHP, mod_perl, ASP...?) If so, you'll have to make sure the particular language interpreter you're using fully understands some form of Unicode, work out the conversions if it isn't using UTF-8 natively -- UCS-2 is next most common -- and check that it's configured to use UTF-8 on its output to the web server.
Does your site have some kind of back-end app server? Does it use UTF-8?
EDIT: There are at least three different places you can declare the charset for a web document. Be sure you change them all:
- the HTTP Content-Type header
- the <meta http-equiv="Content-Type"> tag in your documents' <head>
- the <?xml> tag at the top of the document, if using XHTML Strict

All this comes from my experiences a few years ago when I traced some Unicode data through a moderately complex N-tier app, and found conversion chains like:

Latin-1 -> UTF-8 -> Latin-1 -> UTF-8

Much of this was due to the less mature Unicode support at the time, but you can still find yourself messing with ugliness like this if you're not careful to make the pipeline UTF-8 clean.

As for your comments about searching out Latin-1 characters and converting files one by one, I wouldn't do that. I'd build a script around the iconv utility found on every modern Linux system, feeding in every text file in your system, explicitly converting it from Latin-1 to UTF-8. Leave no stone unturned.

Warren Young 2009-10-20 22:36:36

We are using a CMS, written in PHP that handles the encoding. It is running on PostgreSQL. In the CMS, I can just switch the encoding, which will then change the content type headers in all pages...

Nic Hubbard 2009-10-20 23:02:28

I'd bet that just changes the charset the CMS declares it's using to mod_php, which controls what Apache reports to the browser. Certainly I wouldn't expect it to magically migrate all the data in your DB. It probably won't convert existing templates the CMS uses to build pages.Bottom line: test, test, test. Put some characters in the DB that come from outside the Latin-1 set, and see if they survive to the browser. If so, then check to be sure you don't have any redundant conversions like I showed above. If not, something is still smashing UTF-8 to Latin-1.

Warren Young 2009-10-20 23:47:45

Thought of another risk area. Added it to the numbered list above.

Warren Young 2009-10-20 23:56:03

Looks like my DB is encoded as SQL_ASCII. Do I need to change this to UTF-8 as well, or can I leave this?

Nic Hubbard 2009-10-21 15:23:30

It's not going to matter from a pure data storage and retrieval standpoint. But, if the CMS relies on the database to do sorting and text searching, it *does* matter that the DB knows the character encoding. Maybe flipping this switch in the CMS updates all the tables for you. Don't count on it: check.

Warren Young 2009-10-21 21:35:49

Answer 2

+2 A:

Such a change touches (nearly) every part of your system. You need to go through everything, from the database to the PHP to the HTML to the web browser.

Start a test site and subject it to some serious testing (various browsers on various platforms doing various things).

IMO it's important to actually get familiar with UTF-8 and what it means for software. A few quick points:

PHP is mostly byte-oriented. Learn the difference between characters and code points and bytes, and between UTF-8 and Unicode.
UTF-8 is well-designed. For instance, given two UTF-8 strings, a byte-oriented strstr() will still function correctly.
The most common problem is treating a UTF-8 string as ISO-8859-1 and vice versa - you may need to add documentation to your functions stating what kind of encoding they expect, to make these sorts of errors less likely. A variable naming convention for your strings (to indicate what encoding they use) may also help.

Artelius 2009-10-20 22:39:46

ansaurus

tags:

views:

answers:

Change Website Character encoding from iso-8859-1 to UTF-8

related questions