ansaurus

Question

Getting the actual length of a UTF-8 encoded std::string?

Answer 1

+1 A:

try to use an encoding library like iconv. it probably got the api you want.

an alternative is to implement your own utf8strlen which determines the length of each codepoint and iterate codepoints instead of characters.

Omry 2010-10-31 12:56:53

Answer 2

+5 A:

One of the projects I contribute to has a small function that does that:

http://openlierox.git.sourceforge.net/git/gitweb.cgi?p=openlierox/openlierox;a=blob;f=include/Unicode.h;h=a523b464fc65a7ad875e683cd830b41c9a01934a;hb=HEAD

Look for Utf8StringSize. It depends on another tiny function in the same header file.

dark_charlie 2010-10-31 12:58:21

Can I use a few of these functions for my project?

Milo 2010-10-31 13:11:47

Sure, that's why the project is opensource :) Some more useful functions are in include/StringUtils.h, src/common/StringUtils.cpp, src/common/Unicode.cpp.

dark_charlie 2010-10-31 13:13:02

Great thanks a lot!

Milo 2010-10-31 13:15:58

Careful: according to the header, it's licensed under the LGPL, which means you're only good using it if you project is also open source (and in the restricted GPL sense, not the MIT/BSD really open sense). If your project isn't L/GPL, you may have an issue (legally/ethically/etc.); just be aware.

Nick 2010-10-31 17:00:31

@Nick: Well, I can assure you we won't sue anyone.

dark_charlie 2010-10-31 18:38:32

Answer 3

+2 A:

You should probably take the advice of Omry and look into a specialized library for this. That said, if you just want to understand the algorithm to do this, I'll post it below.

Basically, you can convert your string into a wider-element format, such as wchar_t. Note that wchar_t has a few portability issues, because wchar_t is of varying size depending on your platform. On Windows, wchar_t is 2 bytes, and therefore ideal for representing UTF-16. But on UNIX/Linux, it's four-bytes and is therefore used to represent UTF-32. Therefore, for Windows this will only work if you don't include any Unicode codepoints above 0xFFFF. For Linux you can include the entire range of codepoints in a wchar_t. (Fortunately, this issue will be mitigated with the C++0x Unicode character types.)

With that caveat noted, you can create a conversion function using the following algorithm:

template <class OutputIterator>
inline OutputIterator convert(const unsigned char* it, const unsigned char* end, OutputIterator out) 
{
    while (it != end) 
    {
        if (*it < 192) *out++ = *it++; // single byte character
        else if (*it < 224 && it + 1 < end && *(it+1) > 127) { 
            // double byte character
            *out++ = ((*it & 0x1F) << 6) | (*(it+1) & 0x3F);
            it += 2;
        }
        else if (*it < 240 && it + 2 < end && *(it+1) > 127 && *(it+2) > 127) { 
            // triple byte character
            *out++ = ((*it & 0x0F) << 12) | ((*(it+1) & 0x3F) << 6) | (*(it+2) & 0x3F);
            it += 3;
        }
        else if (*it < 248 && it + 3 < end && *(it+1) > 127 && *(it+2) > 127 && *(it+3) > 127) { 
            // 4-byte character
            *out++ = ((*it & 0x07) << 18) | ((*(it+1) & 0x3F) << 12) |
                ((*(it+2) & 0x3F) << 6) | (*(it+3) & 0x3F);
            it += 4;
        }
        else ++it; // Invalid byte sequence (throw an exception here if you want)
    }

    return out;
}

int main()
{
    std::string s = "\u00EAtre";
    cout << s.length() << endl;

    std::wstring output;
    convert(reinterpret_cast<const unsigned char*> (s.c_str()), 
        reinterpret_cast<const unsigned char*>(s.c_str()) + s.length(), std::back_inserter(output));

    cout << output.length() << endl; // Actual length
}

The algorithm isn't fully generic, because the InputIterator needs to be an unsigned char, so you can interpret each byte as having a value between 0 and 0xFF. The OutputIterator is generic, (just so you can use an std::back_inserter and not worry about memory allocation), but its use as a generic parameter is limited: basically, it has to output to an array of elements large enough to represent a UTF-16 or UTF-32 character, such as wchar_t, uint32_t or the C++0x char32_t types. Also, I didn't include code to convert character byte sequences greater than 4 bytes, but you should get the point of how the algorithm works from what's posted.

Also, if you just want to count the number of characters, rather than output to a new wide-character buffer, you can modify the algorithm to include a counter rather than an OutputIterator. Or better yet, just use Marcelo Cantos' answer to count the first-bytes.

Charles Salvia 2010-10-31 13:12:21

+1 for a very detailed answer

dark_charlie 2010-10-31 13:26:39

on the nitpicking front, what makes you think the "être" string will use UTF8 encoding ? I believe it's non-standard in C/C++ to use non-ascii in source code (and indeed, some compilers will choose another encoding).

Bahbar 2010-10-31 16:15:34

@Bahbar, good point. It should in fact use `\u` hex notation.

Charles Salvia 2010-10-31 16:46:11

Answer 4

+8 A:

Count all first-bytes (the ones that don't match 10xxxxxx).

int len = 0;
while (*s) len += (*s++ & 0xc0) != 0x80;

Marcelo Cantos 2010-10-31 13:13:37

sth 2010-10-31 13:21:36

@sth: Thanks for the tip. Amended.

Marcelo Cantos 2010-10-31 13:23:10

Answer 5

+1 A:

This is a naive implementation, but it should be helpful for you to see how this is done:

std::size_t utf8_length(std::string const &s) {
  std::size_t len = 0;
  std::string::const_iterator begin = s.begin(), end = s.end();
  while (begin != end) {
    unsigned char c = *begin;
    int n;
    if      ((c & 0x80) == 0)    n = 1;
    else if ((c & 0xE0) == 0xC0) n = 2;
    else if ((c & 0xF0) == 0xE0) n = 3;
    else if ((c & 0xF8) == 0xF0) n = 4;
    else throw std::runtime_error("utf8_length: invalid UTF-8");

    if (end - begin < n) {
      throw std::runtime_error("utf8_length: string too short");
    }
    for (int i = 1; i < n; ++i) {
      if ((begin[i] & 0xC0) != 0x80) {
        throw std::runtime_error("utf8_length: expected continuation byte");
      }
    }
    len += n;
    begin += n;
  }
  return len;
}

Roger Pate 2010-10-31 13:22:55

Answer 6

A:

UTF-8 CPP library has a function that does just that. You can either include the library into your project (it is small) or just look at the function. http://utfcpp.sourceforge.net/

char* twochars = "\xe6\x97\xa5\xd1\x88";
size_t dist = utf8::distance(twochars, twochars + 5);
assert (dist == 2);

Nemanja Trifunovic 2010-10-31 14:02:35

Answer 7

+1 A:

I recommend you use UTF8-CPP. It's a header-only library for working with UTF-8 in C++. With this lib, it would look something like this:

int LenghtOfUtf8String( const std::string &utf8_string ) 
{
    return utf8::distance( utf8_string.begin(), utf8_string.end() ); 
}

(Code is from the top of my head.)

Lucas 2010-10-31 15:43:17

ansaurus

tags:

views:

answers:

Getting the actual length of a UTF-8 encoded std::string?

related questions