How do you troubleshoot character encoding problems?

Firstly, "ugly no-char boxes" might not be an encoding problem, they might just be a sign you don't have a font installed that can display the glyphs in the page.

Most character encoding problems happen when strings are being passed from one system to another. For webapps, this is usually between the browser and the application, between the application and the filesystem and between the application and the database.

So you need to check where the mis-encoded data is coming from, what character encoding it has at the source, and what encoding it is being received as. The best way is to send through characters you know the system is having problems with, and examine them at each level of the app. What do they look like inside the app? In the database? When you get them back from the database? When they're displayed in the browser?

Sorry to be so general, but the question doesn't give much more to work with.

If the data you send to the browser becomes mangled (moji-bake) you will get trash characters. Also, if you specify the wrong character set in your META headers, your browser will render the page incorrectly, causing moji-bake again, sometimes in random places on the page.

When handling CJK character sets, you must be sure to use UTF8 character encoding throughout the lifetime of your program (data storage, retrieval, data manipulation in your code, displaying in the browsser etc...)

What is UTF8? UTF8 handles binary streams of data, not strings. This means the bit combinations can have variable length. ASCII characters have a fixed length of 8 bits representing 1 byte, however UTF8 characters can be composed of 6bits, 8bits, 12bits, etc... As such, UTF8 is prone to what Japanese call "mojibake".

As a coder, from database to codebase to browser, you should try and use UTF8 completely. For email you can use UTF8, but you will probably find most mail servers and clients are still old and use a mishmash of different character sets (e.g. ISO9022X).

Database Settings If you are a mysql user, then make sure you have to ensure all connections to the DB use UTF8, and that all tables/fields use UTF8. By default mysql uses Latin (Swedish) character sets. Those kooky swedes love their sense of humour!!

Checking your Codebase In my experience editors like Notepad++, Notepad2, UltraEdit, e, etc... all have UTF8 support problems. They mostly work, but since their developers don't use CJK languages themselves, they are not perfected. Issues like turning off BOM (Byte Order Mark), mangled tabs, poor character set conversion, etc ... all present problems.

I highly recommend using a proven UTF8 editor like Maruo. This is made by a Japanese company, but there is an English version (and a trial version) at http://hide.maruo.co.jp/software/bin/maruo614_signed.exe

Lastly, you may need to convert your source files into UTF8. Especially if the codebase itself has CJK language strings contained therein.

Manipulating Strings Any string function need to multibyte safe. Notice I didn't say double-byte. UTF8 is not a double byte but multibyte, depending on the total number of bits used to represent a character. In PHP you need to call the MB string functions specifically. Ruby and other languages have more transparent support, but you need to check the docs for your flavour of application server!

META Tags Check out google.co.jp or yahoo.co.jp for their META headers. These are sites that know how to to it properly. Basically include the following META tag the doucment <HEAD>

It is usually safe to mix English HTML document type attributes with the above character too. So adding the META tag above seems to work in a HTML document that has:

Email This is a wholly different can of worms. UTF8 works a lot, but many older Japanese clients use ISO2022X more. This is not worth covering here.

Debugging UTF8 Issues Once you have a reliable UTF8 editor like Maruo, you can create static pages and resolve your issues.

Hope that helps

ansaurus

tags:

views:

answers:

How do you troubleshoot character encoding problems?

related questions