ansaurus

Question

Really Good, Bad UTF-8 example test data

Answer 1

+4 A:

Wikipedia’s UTF-8 article has a good summary of what byte sequences are valid/invalid. Another article that’s worth reading is W3C I18N FAQ: Multilingual Forms.

Gumbo 2009-08-23 17:09:07

Answer 2

A:

Automate it, write something that generates a sequence of random bits and run that through your code.

It's pretty hard to think of all the corner cases manually.

Alexander Kjäll 2009-08-23 17:10:49

Answer 3

+1 A:

Off the top of my head:

0xff and 0xfe

Single high-bit bytes

Multi-byte representation of low-byte characters - A good way of smuggling nulls past early checks

Byte-order marks - Are you going to ignore them?

NFC vs. NFD

Douglas Leeder 2009-08-23 17:22:16

Answer 4

A:

Byte-order marks - Are you going to ignore them?

No I have some Neosporin for that. Next time I'll remember to sleep inside the hexdec so I don't get bitten. That and I need to debug_zval_dump around the camp...

Xeoncross 2009-08-23 17:38:53

Answer 5

+3 A:

See also How does a file with Chinese characters know how many bytes to use per character? - no doubt, there are other SO questions that would also help.

In UTF-8, you get the following types of bytes:

Binary    Hex          Comments
0xxxxxxx  0x00..0x7F   Only byte of a 1-byte character encoding
10xxxxxx  0x80..0xBF   Continuation characters (1-3 continuation characters)
110xxxxx  0xC0..0xDF   First byte of a 2-byte character encoding
1110xxxx  0xE0..0xEF   First byte of a 3-byte character encoding
11110xxx  0xF0..0xF4   First byte of a 4-byte character encoding

(The last line looks as if it should read 0xF0..0xF7; however, the 21-bit range of Unicode means that the maximum valid value is 0xF4; values 0xF5..0xF7 cannot occur in valid UTF-8.)

Looking at whether a particular sequence of bytes is valid UTF-8 means you need to think about:

Continuation characters appearing where not expected
Non-continuation characters appearing where a continuation character expected
Incomplete characters at end of string (variation of 'continuation character expected')
Non-minimal sequences
UTF-16 surrogates

In valid UTF-8, the bytes 0xF5..0xFF cannot occur.

Non-minimal sequences

There are multiple possible representations for some characters. For example, the Unicode character U+0000 (ASCII NUL) could be represented by:

0x00
0xC0 0x00
0xE0 0x00 0x00
0xF0 0x00 0x00 0x00

However, the Unicode standard clearly states that the last three alternatives are not acceptable because they are not minimal. It so happens that the bytes 0xC0 and 0xC1 can never appear in valid UTF-8 because the only characters that could be encoded by those are minimally encoded as single byte characters in the range 0x00..0x7F.

UTF-16 Surrogates

Within the Basic Multi-lingual Plane (BMP), the Unicode values U+D800 - U+DFFF are reserved for UTF-16 surrogates and cannot appear in valid UTF-8.

Bad Data

So, your BAD data should contain samples violating these various prescriptions.

Continuation character not preceded by one of the initial byte values
Multi-character initial bytes not followed by enough continuation characters
Non-minimal multi-byte characters
UTF-16 surrogates
Invalid bytes (0xC0, 0xC1, 0xF5..0xFF).

Note that a byte-order mark cannot appear unencoded in UTF-8 - the bytes 0xFF and 0xFE are not permitted in valid UTF-8. An encoded zero-width non-breaking space (U+FFFE) can appear in a UTF-8 file as 0xEF 0xBB 0xBF. But the BOM is completely superfluous in UTF-8.

There are also some non-characters. U+FFFF is one such - it cannot appear in Unicode, and hence should not appear in UTF-8 encoded data either.

Jonathan Leffler 2009-08-23 18:39:05

Thanks for this great list. I plan on checking each of these out in more detail now.

Xeoncross 2009-08-25 02:54:05

Answer 6

+6 A:

Check out Markus Kuhn’s UTF-8 decoder stress test

zildjohn01 2009-08-23 19:33:16

While you didn't put any effort into this - that page is just what I was looking for. ;)

Xeoncross 2009-08-25 02:56:26

Don't forget, knowing where to find the answer is often as important as knowing the answer.

Jonathan Leffler 2009-08-25 03:27:25

ansaurus

tags:

views:

answers:

Really Good, Bad UTF-8 example test data

Non-minimal sequences

UTF-16 Surrogates

Bad Data

related questions