views:

782

answers:

6

I'd like to test the Unicode handling of my code. Is there anything I can put in random.choice() to select from the entire Unicode range, preferably not an external module? Neither Google nor StackOverflow seems to have an answer.

Edit: It looks like this is more complex than expected, so I'll rephrase the question - Is the following code sufficient to generate all valid non-control characters in Unicode?

unicode_glyphs = ''.join(
    unichr(char)
    for char in xrange(1114112) # 0x10ffff + 1
    if unicodedata.category(unichr(char))[0] in ('LMNPSZ')
    )
A: 

Since Unicode is just a range of - well - codes, what about using unichr() to get the unicode string corresponding to a random number between 0 and 0xFFFF?
(Of course that would give just one codepoint, so iterate as required)

Joril
Unfortinately, it's not so simple. Unicode contains much more than 0x100000 characters, and the range is not connected. For example, the surrogate values must never appear as single code points. So the question of what forms a valid UTF-8 string is highly nontrivial. The details are described in definition D92 of Chapter 3 of the Unicode Standard. There is also a table (3–7)) that lists all valid possibilities for UTF-8 byte sequences.
Philipp
I see, thanks :)
Joril
Unicode runs from U+0000 to U+10FFFF; there are also numerous code points that are not valid, including (as it happens) U+FFFF. The Unicode standard says of it "<not a character> - the value FFFF is guaranteed not to be a Unicode character at all".
Jonathan Leffler
UTF-8 is a Unicode encoding.
ThomasH
A: 

You could download a website written in greek or german that uses unicode and feed that to your code.

voyager
+6  A: 

There is a UTF-8 stress test from Markus Kuhn you could use.

See also Really Good, Bad UTF-8 example test data.

Gumbo
That would be usefull to ensure that the program doesn't break when given incorrect text, but it wouldn't help as a comformance test.
voyager
+1. l0b0: don't worry about generating random unicode. Borrowing someone else's wheel > reinventing it.
Matt Ball
+2  A: 

It depends how thoroughly you want to do the testing and how accurately you want to do the generation. In full, Unicode is a 21-bit code set (U+0000 .. U+10FFFF). However, some quite large chunks of that range are set aside for custom characters. Do you want to worry about generating combining characters at the start of a string (because they should only appear after another character)?

The basic approach I'd adopt is randomly generate a Unicode code point (say U+2397 or U+31232), validate it in context (is it a legitimate character; can it appear here in the string) and encode valid code points in UTF-8.

If you just want to check whether your code handles malformed UTF-8 correctly, you can use much simpler generation schemes.

Note that you need to know what to expect given the input - otherwise you are not testing; you are experimenting.

Jonathan Leffler
+1  A: 

Here is an example function that probably creates a random well-formed UTF-8 sequence, as defined in Table 3–7 of Unicode 5.0.0:

#!/usr/bin/env python3.1

# From Table 3–7 of the Unicode Standard 5.0.0

import random

def byte_range(first, last):
    return list(range(first, last+1))

first_values = byte_range(0x00, 0x7F) + byte_range(0xC2, 0xF4)
trailing_values = byte_range(0x80, 0xBF)

def random_utf8_seq():
    first = random.choice(first_values)
    if first <= 0x7F:
        return bytes([first])
    elif first <= 0xDF:
        return bytes([first, random.choice(trailing_values)])
    elif first == 0xE0:
        return bytes([first, random.choice(byte_range(0xA0, 0xBF)), random.choice(trailing_values)])
    elif first == 0xED:
        return bytes([first, random.choice(byte_range(0x80, 0x9F)), random.choice(trailing_values)])
    elif first <= 0xEF:
        return bytes([first, random.choice(trailing_values), random.choice(trailing_values)])
    elif first == 0xF0:
        return bytes([first, random.choice(byte_range(0x90, 0xBF)), random.choice(trailing_values), random.choice(trailing_values)])
    elif first <= 0xF3:
        return bytes([first, random.choice(trailing_values), random.choice(trailing_values), random.choice(trailing_values)])
    elif first == 0xF4:
        return bytes([first, random.choice(byte_range(0x80, 0x8F)), random.choice(trailing_values), random.choice(trailing_values)])

print("".join(str(random_utf8_seq(), "utf8") for i in range(10)))

Because of the vastness of the Unicode standard I cannot test this thoroughly. Also note that the characters are not equally distributed (but each byte in the sequence is).

Philipp
A: 

Answering revised question:

Yes, on a strict definition of "control characters" -- note that you won't include CR, LF, and TAB; is that what you want?

Please consider responding to my earlier invitation to tell us what you are really trying to do.

John Machin