views:

407

answers:

3

I'm testing how some of my code handles bad data, and I need a few series of bytes that are invalid utf8. Can you post some, and ideally, an explanation of why they are bad/where you got them?

Thanks!

+5  A: 

Take a look at Markus Kuhn's UTF-8 decoder capability and stress test file

You'll find examples of many UTF-8 irregularities, including lonely start bytes, continuation bytes missing, overlong sequences, etc.

Nemanja Trifunovic
Awesome answer -- exactly what I needed. You rock!
twk
A: 

Fuzz Testing - generate a random sequence of octets. Most likely you'll get some illegal sequences sooner than later.

shoosh
+1  A: 

In PHP:

$examples = array(
    'Valid ASCII' => "a",
    'Valid 2 Octet Sequence' => "\xc3\xb1",
    'Invalid 2 Octet Sequence' => "\xc3\x28",
    'Invalid Sequence Identifier' => "\xa0\xa1",
    'Valid 3 Octet Sequence' => "\xe2\x82\xa1",
    'Invalid 3 Octet Sequence (in 2nd Octet)' => "\xe2\x28\xa1",
    'Invalid 3 Octet Sequence (in 3rd Octet)' => "\xe2\x82\x28",
    'Valid 4 Octet Sequence' => "\xf0\x90\x8c\xbc",
    'Invalid 4 Octet Sequence (in 2nd Octet)' => "\xf0\x28\x8c\xbc",
    'Invalid 4 Octet Sequence (in 3rd Octet)' => "\xf0\x90\x28\xbc",
    'Invalid 4 Octet Sequence (in 4th Octet)' => "\xf0\x28\x8c\x28",
    'Valid 5 Octet Sequence (but not Unicode!)' => "\xf8\xa1\xa1\xa1\xa1",
    'Valid 6 Octet Sequence (but not Unicode!)' => "\xfc\xa1\xa1\xa1\xa1\xa1",
);

From http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php#54805

philfreo