ansaurus

Question

C++ iterate or split UTF-8 string into array of symbols?

Answer 1

+1 A:

Kirill V. Lyadvinsky 2010-05-17 21:26:38

"3rd-party-library- independent way"

topright 2010-05-17 21:54:44

+1 ICU is a cross-platform library released under a permissive open source license. If the OP wishes to avoid dependency on a 3rd party library, the source code of ICU is freely available but contains over 1e+06 lines of code.

cj 2010-05-17 22:30:34

It is open source. You could try to use source of string iterator from it, not all 1e+06 lines of code.

Kirill V. Lyadvinsky 2010-05-18 04:15:14

Answer 2

+2 A:

If I understand correctly, it sounds like you want to find the start of each UTF-8 character. If so, then it would be fairly straightforward to parse them (interpreting them is a different matter). But the definition of how many octets are involved is well-defined by the RFC:

Char. number range  |        UTF-8 octet sequence
   (hexadecimal)    |              (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

For example, if lb has the first octet of a UTF-8 character, I think the following would determine the number of octets involved.

unsigned char lb;

if (( lb & 0x80 ) == 0 )          // lead bit is zero, must be a single ascii
   printf( "1 octet\n" );
else if (( lb & 0xE0 ) == 0xC0 )  // 110x xxxx
   printf( "2 octets\n" );
else if (( lb & 0xF0 ) == 0xE0 ) // 1110 xxxx
   printf( "3 octets\n" );
else if (( lb & 0xF8 ) == 0xF0 ) // 1111 0xxx
   printf( "4 octets\n" );
else
   printf( "Unrecognized lead byte (%02x)\n", lb );

Ultimately, though, you are going to be much better off using an existing library as suggested in another post. The above code might categorize the characters according to octets, but it doesn't help "do" anything with them once that is finished.

Mark Wilkins 2010-05-17 21:34:14

Thank you, useful answer, voted for it.

topright 2010-05-18 21:55:46

Answer 3

A:

Off the cuff:

// Return length of s converted. On success return should equal s.length().
// On error return points to the character where decoding failed.
// Remember to check the success flag since decoding errors could occur at
// the end of the string
int convert(std::vector<int>& u, const std::string& s, bool& success) {
    success = false;
    int cp = 0;
    int runlen = 0;
    for (std::string::const_iterator it = s.begin(), end = s.end(); it != end; ++it) {
        int ch = static_cast<unsigned char>(*it);
        if (runlen > 0) {
            if ((ch & 0xc0 != 0x80) || cp == 0) return it-s.begin();
            cp = (cp << 6) + (ch & 0x3f);
            if (--runlen == 0) {
                u.push_back(cp);
                cp = 0;
            }
        }
        else if (cp == 0) {
            if (ch < 0x80)      { u.push_back(ch); }
            else if (ch > 0xf8) return it-s.begin();
            else if (ch > 0xf0) { cp = ch & 7; runlen = 3; }
            else if (ch > 0xe0) { cp = ch & 0xf; runlen = 2; }
            else if (ch > 0xc0) { cp = ch & 0x1f; runlen = 1; }
            else return it-s.begin(); // stop on error
        }
        else return it-s.begin();
    }
    success = runlen == 0; // verify we are between codepoints
    return s.length();
}

jmucchiello 2010-05-17 22:22:18

Thanks. Does endianess matter for this function?

topright 2010-05-17 22:27:16

"if (*it < 0x80) { u.push_back(*it); }" => "comparison is always true due to limited range of data type"

topright 2010-05-17 22:30:58

invalid conversion from `const char* const' to `char*'

topright 2010-05-17 22:33:32

Ok, I fixed the bugs. UTF8 is strictly byte level so endianness cannot matter.

jmucchiello 2010-05-18 14:56:07

What? "Endianness is the ordering of individually addressable sub-units (words, bytes, or even bits)" (http://en.wikipedia.org/wiki/Endianness) Multi-byte encoding depends on endianess.

topright 2010-05-19 19:48:33

I'll say it again. UTF8 is a byte level encoding. That means you read each byte sequentially. It doesn't matter what order the bits are transmitted in between IP ports or from main memory to the microprocessor's registers. When those bits are put back together they are interpreted the same way on all processors (19 == 19). Endianness is not an issue.

jmucchiello 2010-05-20 15:02:59

Answer 4

+1 A:

UTF8 CPP is exactly what you want

Nemanja Trifunovic 2010-05-17 23:47:06

I've already found this library by myself. I needed a code, but thanks anyway.

topright 2010-05-18 10:11:24

Answer 5

+1 A:

Solved using tiny platform-independent UTF8 CPP library:

    char* str = (char*)text.c_str();    // utf-8 string
    char* str_i = str;                  // string iterator
    char* end = str+strlen(str)+1;      // end iterator

    unsigned char[5] symbol = {0,0,0,0,0};

    do
    {
        uint32_t code = utf8::next(str_i, end); // get 32 bit code of a utf-8 symbol
        if (code == 0)
            continue;

        utf8::append(code, symbol); // initialize array `symbol`
    }
    while ( str_i < end );

topright 2010-05-18 10:10:22

ansaurus

tags:

views:

answers:

C++ iterate or split UTF-8 string into array of symbols?

related questions