tags:

views:

239

answers:

2

Hello, I am currently designing a font engine for an embedded display. The basic problem is the following:

I need to take a dynamically generated text string, look up the values from that string in a UTF-8 table, then use the table to point to the compressed bitmap array of all the supported characters. After that is complete, I call a bitcopy routine that moves the data from the bitmap array to the display.

I will not be supporting the full UTF-8 character set, as I have very limited system resources to work with (32K ROM, 8K RAM), but want to have the ability to add the needed glyphs later on for localization purposes. All development is being done in C and assembly.

The glyph size is a maximum of 16 bits wide by 16 bits tall. We will probably need to have support for the whole of the Basic Multilingual Plane (3 bytes), as some of our larger customers are in Asia. However, we would not be including the whole table in any specific localization.

My question is this:
What is the best way to add this UTF-8 support and associated table?

A: 

You didn't specify the size of your characters, or what is the size of your character set so it is difficult to estimate the size requirements.

I would store the bitmaps in a straight array format, depending on the size of the characters, it might store fairly efficiently without the need to pack/unpack elements.

For example, if we take a 36 character alphabet with a 8x6 character, you need 216 bytes of storage for the array. (6 bytes/character * 36 - Each byte would be a vertical slice of the character).

For the parsing, it is simply a matter of doing offset in the table.
The old (char - 'A') and (char - '0') tricks do quite well.

The other question is where to store the bitmap array. In ROM is the obvious answer, but if you need to support other glyphs it might need reprogramming, which you don't specify if it's an issue.

If the glyphs must be programmed dynamically, then you don't have a choice but to put it in RAM.

Benoit
Thanks for the feedback, I've updated the problem description to include the character size and (vaguely) answer your question on the size of the character set.
RxScram
+1  A: 

The solution below assumes that the lower 16 bits of the Unicode space will be enough for you. If your bitmap table has, say U+0020 through U+007E at positions 0x00 to 0x5E and U+00A0 through U+00FF at positions 0x5F to 0xBE and U+1200 through U+1241 at 0xBF to 0xFF, you could do something like the code below (which isn't tested, not even compile-tested).

bitmapmap contains a series of pairs of values. The first value in the first pair is the Unicode code point which the bitmap at index 0 represents. The assumption is that the bitmap table contains a series of directly adjacent Unicode code points. So the second value says how long this series is.

The first part of the while loop iterates through UTF-8 input and builds up a Unicode code point in ucs2char. Once a complete character is found, the second part searches for that character in one of the ranges mentioned in bitmapmap. If it finds an appropriate bitmap index, it adds it to indexes. Characters for which no bitmap is present are silently dropped.

The function returns the number of bitmap indexes found.

This way of doing things should be memory-efficient in terms of the unicode->bitmap table, reasonably fast and reasonably flexible.

// Code below assumes C99, but is about three cut-and-pastes from C89
// Assuming an unsigned short is 16-bit

unsigned short bitmapmap[]={0x0020, 0x005E,
                            0x00A0, 0x0060,
                            0x1200, 0x0041,
                            0x0000};

int utf8_to_bitmap_indexes(unsigned char *utf8, unsigned short *indexes)
{
    int bitmapsfound=0;
    int utf8numchars;
    unsigned char c;
    unsigned short ucs2char;
    while (*utf8)
    {
        c=*utf8;
        if (c>=0xc0)
        {
            utf8numchars=0;
            while (c&0x80)
            {
                utf8numchars++;
                c<<=1;
            }
            c>>=utf8numchars;
            ucs2char=0;
        }
        else if (utf8numchars && c<0x80)
        {
            // This is invalid UTF-8.  Do our best.
            utf8numchars=0;
        }

        if (utf8numchars)
        {
            c&=0x3f;
            ucs2char<<=6;
            ucs2char+=c;
            utf8numchars--;
            if (utf8numchars)
                continue; // Our work here is done - no char yet
        }
        else
            ucs2char=c;

        // At this point, we have a complete UCS-2 char in ucs2char

        unsigned short bmpsearch=0;
        unsigned short bmpix=0;
        while (bitmapmap[bmpsearch])
        {
            if (ucs2char>=bitmapmap[bmpsearch] && ucs2char<=bitmapmap[bmpsearch]+bitmapmap[bmpsearch+1])
            {
                *indexes++ = bmpix+(ucs2char-bitmapmap[bmpsearch]);
                bitmapsfound++;
                break;
            }

            bmpix+=bitmapmap[bmpsearch+1];
            bmpsearch+=2;
        }
    }
    return bitmapsfound;
}

EDIT: You mentioned that you need more than the lower 16 bits. s/unsigned short/unsigned int/;s/ucs2char/codepoint/; in the above code and it can then do the whole Unicode space.

Jon Bright
Great answer, thanks for your help.
RxScram