ansaurus

Question

How do I add UTF-8 support, and an associated font-table, to an embedded project?

Answer 1

A:

You didn't specify the size of your characters, or what is the size of your character set so it is difficult to estimate the size requirements.

I would store the bitmaps in a straight array format, depending on the size of the characters, it might store fairly efficiently without the need to pack/unpack elements.

For example, if we take a 36 character alphabet with a 8x6 character, you need 216 bytes of storage for the array. (6 bytes/character * 36 - Each byte would be a vertical slice of the character).

For the parsing, it is simply a matter of doing offset in the table.
The old (char - 'A') and (char - '0') tricks do quite well.

The other question is where to store the bitmap array. In ROM is the obvious answer, but if you need to support other glyphs it might need reprogramming, which you don't specify if it's an issue.

If the glyphs must be programmed dynamically, then you don't have a choice but to put it in RAM.

Benoit 2009-03-19 21:07:40

Thanks for the feedback, I've updated the problem description to include the character size and (vaguely) answer your question on the size of the character set.

RxScram 2009-03-19 21:58:25

Answer 2

+1 A:

The solution below assumes that the lower 16 bits of the Unicode space will be enough for you. If your bitmap table has, say U+0020 through U+007E at positions 0x00 to 0x5E and U+00A0 through U+00FF at positions 0x5F to 0xBE and U+1200 through U+1241 at 0xBF to 0xFF, you could do something like the code below (which isn't tested, not even compile-tested).

bitmapmap contains a series of pairs of values. The first value in the first pair is the Unicode code point which the bitmap at index 0 represents. The assumption is that the bitmap table contains a series of directly adjacent Unicode code points. So the second value says how long this series is.

The first part of the while loop iterates through UTF-8 input and builds up a Unicode code point in ucs2char. Once a complete character is found, the second part searches for that character in one of the ranges mentioned in bitmapmap. If it finds an appropriate bitmap index, it adds it to indexes. Characters for which no bitmap is present are silently dropped.

The function returns the number of bitmap indexes found.

This way of doing things should be memory-efficient in terms of the unicode->bitmap table, reasonably fast and reasonably flexible.

// Code below assumes C99, but is about three cut-and-pastes from C89
// Assuming an unsigned short is 16-bit

unsigned short bitmapmap[]={0x0020, 0x005E,
                            0x00A0, 0x0060,
                            0x1200, 0x0041,
                            0x0000};

int utf8_to_bitmap_indexes(unsigned char *utf8, unsigned short *indexes)
{
    int bitmapsfound=0;
    int utf8numchars;
    unsigned char c;
    unsigned short ucs2char;
    while (*utf8)
    {
        c=*utf8;
        if (c>=0xc0)
        {
            utf8numchars=0;
            while (c&0x80)
            {
                utf8numchars++;
                c<<=1;
            }
            c>>=utf8numchars;
            ucs2char=0;
        }
        else if (utf8numchars && c<0x80)
        {
            // This is invalid UTF-8.  Do our best.
            utf8numchars=0;
        }

        if (utf8numchars)
        {
            c&=0x3f;
            ucs2char<<=6;
            ucs2char+=c;
            utf8numchars--;
            if (utf8numchars)
                continue; // Our work here is done - no char yet
        }
        else
            ucs2char=c;

        // At this point, we have a complete UCS-2 char in ucs2char

        unsigned short bmpsearch=0;
        unsigned short bmpix=0;
        while (bitmapmap[bmpsearch])
        {
            if (ucs2char>=bitmapmap[bmpsearch] && ucs2char<=bitmapmap[bmpsearch]+bitmapmap[bmpsearch+1])
            {
                *indexes++ = bmpix+(ucs2char-bitmapmap[bmpsearch]);
                bitmapsfound++;
                break;
            }

            bmpix+=bitmapmap[bmpsearch+1];
            bmpsearch+=2;
        }
    }
    return bitmapsfound;
}

EDIT: You mentioned that you need more than the lower 16 bits. s/unsigned short/unsigned int/;s/ucs2char/codepoint/; in the above code and it can then do the whole Unicode space.

Jon Bright 2009-03-19 21:42:19

Great answer, thanks for your help.

RxScram 2009-03-19 22:11:55

ansaurus

tags:

views:

answers:

How do I add UTF-8 support, and an associated font-table, to an embedded project?

related questions