ansaurus

Question

How to test an application for correct encoding (e.g. UTF-8)

Answer 1

+2 A:

Localization is pretty tough.

I think you are really asking two questions. One of them, how do you get everybody to correctly work on an i8n application, is not technical, but a project management issue in my opinion. If you want people to use a common standard, like UTF-8, then you will simply have to enforce that. Tools will help but people will first need to be told to do so.

Besides saying that UTF-8 is in my opinion the way to go, it is hard to give an answer to the questions about tools. It really depends on the kind of project you are doing. If it for example is a Java project that you are talking about then it is a simple matter of properly configuring the IDE to encode files in UTF-8. And to make sure your UTF-8 localizations are in external resource files.

One thing you can certainly do is to make unit tests that check compliance. If your localized messages/labels are in resource files then it is faily easy to check if they are properly UTF-8 encoded I think.

St3fan 2009-01-25 20:35:16

You're right - it's multiple questions at once. Mainly because I haven't found out how to really tackle the problem (other than just "making no mistakes"...) I'm looking for any tools for my toolbox to help in current and future projects.

Olaf 2009-01-25 21:27:38

plus - your one typo describes the situations I've experienced best: "it's *faily* easy to check..." I like that, it really has some truth in it ;-)

Olaf 2009-01-25 21:30:00

Answer 2

+1 A:

There is a regular expression to test if a string is valid UTF-8:

$field =~
  m/\A(
     [\x09\x0A\x0D\x20-\x7E]            # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
  )*\z/x;

But this doesn’t ensure that the text actual is UTF-8.

An example: The byte sequence for the letter ö (U+00F6) and the corresponding UTF-8 sequence is 0xC3B6.
So when you get 0xC3B6 as input you can say that it is valid UTF-8. But you cannot surely say that the letter ö has been submitted.
This is because imagine that not UTF-8 has been used but ISO 8859-1 instead. There the sequence 0xC3B6 represents the character Ã (0xC3) and ¶ (0xB6) respectivly.
So the sequence 0xC3B6 can either represent ö using UTF-8 or Ã¶ using ISO 8859-1 (although the latter is rather unusual).

So in the end it’s only guessing.

Gumbo 2009-01-25 20:48:19

Wow - this is the least expected angle to tackle the problem. I'm impressed. Also, ¶ belongs to the characters most easily detected as encoding error.

Olaf 2009-01-25 21:33:09

Answer 3

+1 A:

The real troublemaker with character encoding is quite often that there are multiple encoding-related bugs and that some incorrect behavior has been introduced because of other bugs. I have no count of how many times I have seen this happen.

The goal, as always, is to handle it correctly in every single place. So most of the time simple unit tests can do the trick, it doesn't even have to be very complex character sets. I find all out bugs just by testing on our national character "ø", because it maps differently in UTF-8 and most of the other character sets.

The aggregate works fine when all the pieces do it correctly. I know this sounds trivial, but when it comes to character set issues it's always worked for me ;)

krosenvold 2009-01-25 21:06:02

This is our company talk - "As soon as you're doing it right - problems go away". :) How are you making sure that the tests for "ö" in UTF-8 are not working if it tests for - say - "Ã¶" in ISO-8859-1 - i.e. assertEquals("ö","ö") becomes assertEquals("Ã¶","Ã¶") - figuratively

Olaf 2009-01-25 21:39:15

You assert with the \u escape sequence vs the non-escaped character

krosenvold 2009-01-26 04:40:58

Answer 4

+1 A:

In PHP we use the mb_ functions such as mb_detect_encoding() and mb_convert_encoding(). They aren't perfect, but they get us 99.9% of the way there. Than we have a few regular expressions to strip out funky characters that somehow make there way in at times.

If you are going international, you definitely want to use UTF-8. We have yet to find the perfect solution for getting all of our data into UTF-8, and i'm not sure one exists. You just have to keep tinkering with it.

jjriv 2009-01-30 04:10:21

mb_detect_encoding seems to provide a similar approach as the regexp provided by Gumbo though better readable - it looks similarly heuristic in that the Ã¶ would also still exist, right? Thanks for your input.

Olaf 2009-01-30 08:15:51

Answer 5

+1 A:

Thank you for fliptitle!

I, too, am trying to lay out a proper test plan to make sure that an application supports Unicode strings throughout the system.

I am bilingual, but in two languages that only use ISO-8859-1. Therefore, I have been struggling to determine what is a "real-life," "meaningful" way to test the full range of Unicode possibilities.

I just came across this:

International Testing Basics - Testing non-English and non-ASCII support

I will post further highlights as I discover them.

Follow-Up Post:

After devising some tests for my application, I realized that I had put together a small list of encoded values that might be helpful to others.

I am using the following international strings in my test:

(NOTE: here comes some UTF-8 encoded text... hopefully you can see this in your browser)

ユーザー別サイト
简体中文
크로스 플랫폼으로
מדורים מבוקשים
أفضل البحوث
Σὲ γνωρίζω ἀπὸ
Десятую Международную
แผ่นดินฮั่นเสื่อมโทรมแสนสังเวช
∮ E⋅da = Q, n → ∞, ∑ f(i) = ∏ g(i)
français langue étrangère
mañana olé

(End of UTF-8 foreign/non-English text)

However, at various points during testing, I realized that it was insufficient to only have information about how the strings were supposed to look when rendered in their respective foreign alphabets. I also needed to know the correct unicode codepoint numbers, and also the correct hexadecimal values for these strings in at least two encodings (UCS-2 and UTF-8).

So... for anyone who would like to use the same strings that I list above, here is the equivalent code-point numbering and hex values -- already translated for you!

str = L"\u30E6\u30FC\u30B6\u30FC\u5225\u30B5\u30A4\u30C8"; // JAPAN 
// Little endian UTF-16/UCS-2: e6 30 fc 30 b6 30 fc 30 25 52 b5 30 a4 30 c8 30 00 00
// Hex of UTF-8: e3 83 a6 e3 83 bc e3 82 b6 e3 83 bc e5 88 a5 e3 82 b5 e3 82 a4 e3 83 88 00 

str = L"\u7B80\u4F53\u4E2D\u6587"; // CHINA 
// Little endian UTF-16/UCS-2: 80 7b 53 4f 2d 4e 87 65 00 00 
// Hex of UTF-8: e7 ae 80 e4 bd 93 e4 b8 ad e6 96 87 00

str = L"\uD06C\uB85C\uC2A4 \uD50C\uB7AB\uD3FC\uC73C\uB85C"; // KOREA 
// Little endian UTF-16/UCS-2: 6c d0 5c b8 a4 c2 20 00 0c d5 ab b7 fc d3 3c c7 5c b8 00 00
// Hex of UTF-8: ed 81 ac eb a1 9c ec 8a a4 20 ed 94 8c eb 9e ab ed 8f bc ec 9c bc eb a1 9c 00 

str = L"\u05DE\u05D3\u05D5\u05E8\u05D9\u05DD \u05DE\u05D1\u05D5\u05E7\u05E9\u05D9\u05DD"; // ISRAEL 
// Little endian UTF-16/UCS-2: de 05 d3 05 d5 05 e8 05 d9 05 dd 05 20 00 de 05 d1 05 d5 05 e7 05 e9 05 d9 05 dd 05 00 00
// Hex of UTF-8: d7 9e d7 93 d7 95 d7 a8 d7 99 d7 9d 20 d7 9e d7 91 d7 95 d7 a7 d7 a9 d7 99 d7 9d 00

str = L"\u0623\u0641\u0636\u0644 \u0627\u0644\u0628\u062D\u0648\u062B"; // EGYPT 
// Little endian UTF-16/UCS-2: 23 06 41 06 36 06 44 06 20 00 27 06 44 06 28 06 2d 06 48 06 2b 06 00 00
// Hex of UTF-8: d8 a3 d9 81 d8 b6 d9 84 20 d8 a7 d9 84 d8 a8 d8 ad d9 88 d8 ab 00 

str = L"\u03A3\u1F72 \u03B3\u03BD\u03C9\u03C1\u03AF\u03B6\u03C9 \u1F00\u03C0\u1F78"; // GREECE 
// Little endian UTF-16/UCS-2: a3 03 72 1f 20 00 b3 03 bd 03 c9 03 c1 03 af 03 b6 03 c9 03 20 00 00
// Hex of UTF-8: ce a3 e1 bd b2 20 ce b3 ce bd cf 89 cf 81 ce af ce b6 cf 89 20 e1 bc 80 cf 80 e1 bd b8 00 

str = L"\u0414\u0435\u0441\u044F\u0442\u0443\u044E \u041C\u0435\u0436\u0434\u0443\u043D\u0430\u0440\u043E\u0434\u043D\u0443\u044E"; // RUSSIA 
// Little endian UTF-16/UCS-2: 14 04 35 04 41 04 4f 04 42 04 43 04 4e 04 20 00 1c 04 35 04 36 04 34 04 43 04 3d 04 30 04 40 04 3e 04 34 04 3d 04 43 04 4e 04 00 00
// Hex of UTF-8: d0 94 d0 b5 d1 81 d1 8f d1 82 d1 83 d1 8e 20 d0 9c d0 b5 d0 b6 d0 b4 d1 83 d0 bd d0 b0 d1 80 d0 be d0 b4 d0 bd d1 83 d1 8e 00

str = L"\u0E41\u0E1C\u0E48\u0E19\u0E14\u0E34\u0E19\u0E2E\u0E31\u0E48\u0E19\u0E40\u0E2A\u0E37\u0E48\u0E2D\u0E21\u0E42\u0E17\u0E23\u0E21\u0E41\u0E2A\u0E19\u0E2A\u0E31\u0E07\u0E40\u0E27\u0E0A"; // THAILAND
// Little endian UTF-16/UCS-2: 41 0e 1c 0e 48 0e 19 0e 14 0e 34 0e 19 0e 2e 0e 31 0e 48 0e 19 0e 40 0e 2a 0e 37 0e 48 0e 2d 0e 21 0e 42 0e 17 0e 23 0e 21 0e 41 0e 2a 0e 19 0e 2a 0e 31 0e 07 0e 40 0e 27 0e 0a 0e 00 00
// Hex of UTF-8: e0 b9 81 e0 b8 9c e0 b9 88 e0 b8 99 e0 b8 94 e0 b8 b4 e0 b8 99 e0 b8 ae e0 b8 b1 e0 b9 88 e0 b8 99 e0 b9 80 e0 b8 aa e0 b8 b7 e0 b9 88 e0 b8 ad e0 b8 a1 e0 b9 82 e0 b8 97 e0 b8 a3 e0 b8 a1 e0 b9 81 e0 b8 aa e0 b8 99 e0 b8 aa e0 b8 b1 e0 b8 87 e0 b9 80 e0 b8 a7 e0 b8 8a 00

str = L"\u222E E\u22C5da = Q,  n \u2192 \u221E, \u2211 f(i) = \u220F g(i)"; // MATHEMATICS 
// Little endian UTF-16/UCS-2: 2e 22 20 00 45 00 c5 22 64 00 61 00 20 00 3d 00 20 00 51 00 2c 00 20 00 20 00 6e 00 20 00 92 21 20 00 1e 22 2c 00 20 00 11 22 20 00 66 00 28 00 69 00 29 00 20 00 3d 00 20 00 0f 22 20 00 67 00 28 00 69 00 29 00 00 00
// Hex of UTF-8: e2 88 ae 20 45 e2 8b 85 64 61 20 3d 20 51 2c 20 20 6e 20 e2 86 92 20 e2 88 9e 2c 20 e2 88 91 20 66 28 69 29 20 3d 20 e2 88 8f 20 67 28 69 29 00 

str = L"fran\u00E7ais langue \u00E9trang\u00E8re"; // FRANCE
// Little endian UTF-16/UCS-2: 66 00 72 00 61 00 6e 00 e7 00 61 00 69 00 73 00 20 00 6c 00 61 00 6e 00 67 00 75 00 65 00 20 00 e9 00 74 00 72 00 61 00 6e 00 67 00 e8 00 72 00 65 00 00 00
// Hex of UTF-8: 66 72 61 6e c3 a7 61 69 73 20 6c 61 6e 67 75 65 20 c3 a9 74 72 61 6e 67 c3 a8 72 65 00

str = L"ma\u00F1ana ol\u00E9"; // SPAIN
// Little endian UTF-16/UCS-2: 6d 00 61 00 f1 00 61 00 6e 00 61 00 20 00 6f 00 6c 00 e9 00 00 00
// Hex of UTF-8: 6d 61 c3 b1 61 6e 61 20 6f 6c c3 a9 00

Also, here are a couple images that show some common "mis-renderings" that can happen in various editors, even though the underlying bytes are well-formed UTF8. If you see any of these renderings, it probably means that you correctly produced a UTF8 string, but that your editor/viewer is trying to interpret them under some encoding other than UTF8.

Sample Renderings Num. 1

Sample Renderings Num. 2

que que 2010-04-01 18:48:00

ansaurus

tags:

views:

answers:

How to test an application for correct encoding (e.g. UTF-8)

Follow-Up Post:

related questions