tags:

views:

161

answers:

7

my std::string is utf-8 encoded so obviously, str.length() returns the wrong result.

I found this information but I'm not sure how I can use it to do this:

The following byte sequences are used to represent a character. The sequence to be used depends on the UCS code number of the character:

   0x00000000 - 0x0000007F:
       0xxxxxxx

   0x00000080 - 0x000007FF:
       110xxxxx 10xxxxxx

   0x00000800 - 0x0000FFFF:
       1110xxxx 10xxxxxx 10xxxxxx

   0x00010000 - 0x001FFFFF:
       11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

How can I find the actual length of a UTF-8 encoded std::string? Thanks

+1  A: 

try to use an encoding library like iconv. it probably got the api you want.

an alternative is to implement your own utf8strlen which determines the length of each codepoint and iterate codepoints instead of characters.

Omry
+5  A: 

One of the projects I contribute to has a small function that does that:

http://openlierox.git.sourceforge.net/git/gitweb.cgi?p=openlierox/openlierox;a=blob;f=include/Unicode.h;h=a523b464fc65a7ad875e683cd830b41c9a01934a;hb=HEAD

Look for Utf8StringSize. It depends on another tiny function in the same header file.

dark_charlie
Can I use a few of these functions for my project?
Milo
Sure, that's why the project is opensource :) Some more useful functions are in include/StringUtils.h, src/common/StringUtils.cpp, src/common/Unicode.cpp.
dark_charlie
Great thanks a lot!
Milo
Careful: according to the header, it's licensed under the LGPL, which means you're only good using it if you project is also open source (and in the restricted GPL sense, not the MIT/BSD really open sense). If your project isn't L/GPL, you may have an issue (legally/ethically/etc.); just be aware.
Nick
@Nick: Well, I can assure you we won't sue anyone.
dark_charlie
+2  A: 

You should probably take the advice of Omry and look into a specialized library for this. That said, if you just want to understand the algorithm to do this, I'll post it below.

Basically, you can convert your string into a wider-element format, such as wchar_t. Note that wchar_t has a few portability issues, because wchar_t is of varying size depending on your platform. On Windows, wchar_t is 2 bytes, and therefore ideal for representing UTF-16. But on UNIX/Linux, it's four-bytes and is therefore used to represent UTF-32. Therefore, for Windows this will only work if you don't include any Unicode codepoints above 0xFFFF. For Linux you can include the entire range of codepoints in a wchar_t. (Fortunately, this issue will be mitigated with the C++0x Unicode character types.)

With that caveat noted, you can create a conversion function using the following algorithm:

template <class OutputIterator>
inline OutputIterator convert(const unsigned char* it, const unsigned char* end, OutputIterator out) 
{
    while (it != end) 
    {
        if (*it < 192) *out++ = *it++; // single byte character
        else if (*it < 224 && it + 1 < end && *(it+1) > 127) { 
            // double byte character
            *out++ = ((*it & 0x1F) << 6) | (*(it+1) & 0x3F);
            it += 2;
        }
        else if (*it < 240 && it + 2 < end && *(it+1) > 127 && *(it+2) > 127) { 
            // triple byte character
            *out++ = ((*it & 0x0F) << 12) | ((*(it+1) & 0x3F) << 6) | (*(it+2) & 0x3F);
            it += 3;
        }
        else if (*it < 248 && it + 3 < end && *(it+1) > 127 && *(it+2) > 127 && *(it+3) > 127) { 
            // 4-byte character
            *out++ = ((*it & 0x07) << 18) | ((*(it+1) & 0x3F) << 12) |
                ((*(it+2) & 0x3F) << 6) | (*(it+3) & 0x3F);
            it += 4;
        }
        else ++it; // Invalid byte sequence (throw an exception here if you want)
    }

    return out;
}

int main()
{
    std::string s = "\u00EAtre";
    cout << s.length() << endl;

    std::wstring output;
    convert(reinterpret_cast<const unsigned char*> (s.c_str()), 
        reinterpret_cast<const unsigned char*>(s.c_str()) + s.length(), std::back_inserter(output));

    cout << output.length() << endl; // Actual length
}

The algorithm isn't fully generic, because the InputIterator needs to be an unsigned char, so you can interpret each byte as having a value between 0 and 0xFF. The OutputIterator is generic, (just so you can use an std::back_inserter and not worry about memory allocation), but its use as a generic parameter is limited: basically, it has to output to an array of elements large enough to represent a UTF-16 or UTF-32 character, such as wchar_t, uint32_t or the C++0x char32_t types. Also, I didn't include code to convert character byte sequences greater than 4 bytes, but you should get the point of how the algorithm works from what's posted.

Also, if you just want to count the number of characters, rather than output to a new wide-character buffer, you can modify the algorithm to include a counter rather than an OutputIterator. Or better yet, just use Marcelo Cantos' answer to count the first-bytes.

Charles Salvia
+1 for a very detailed answer
dark_charlie
on the nitpicking front, what makes you think the "être" string will use UTF8 encoding ? I believe it's non-standard in C/C++ to use non-ascii in source code (and indeed, some compilers will choose another encoding).
Bahbar
@Bahbar, good point. It should in fact use `\u` hex notation.
Charles Salvia
+8  A: 

Count all first-bytes (the ones that don't match 10xxxxxx).

int len = 0;
while (*s) len += (*s++ & 0xc0) != 0x80;
Marcelo Cantos
sth
@sth: Thanks for the tip. Amended.
Marcelo Cantos
+1  A: 

This is a naive implementation, but it should be helpful for you to see how this is done:

std::size_t utf8_length(std::string const &s) {
  std::size_t len = 0;
  std::string::const_iterator begin = s.begin(), end = s.end();
  while (begin != end) {
    unsigned char c = *begin;
    int n;
    if      ((c & 0x80) == 0)    n = 1;
    else if ((c & 0xE0) == 0xC0) n = 2;
    else if ((c & 0xF0) == 0xE0) n = 3;
    else if ((c & 0xF8) == 0xF0) n = 4;
    else throw std::runtime_error("utf8_length: invalid UTF-8");

    if (end - begin < n) {
      throw std::runtime_error("utf8_length: string too short");
    }
    for (int i = 1; i < n; ++i) {
      if ((begin[i] & 0xC0) != 0x80) {
        throw std::runtime_error("utf8_length: expected continuation byte");
      }
    }
    len += n;
    begin += n;
  }
  return len;
}
Roger Pate
A: 

UTF-8 CPP library has a function that does just that. You can either include the library into your project (it is small) or just look at the function. http://utfcpp.sourceforge.net/

char* twochars = "\xe6\x97\xa5\xd1\x88";
size_t dist = utf8::distance(twochars, twochars + 5);
assert (dist == 2);
Nemanja Trifunovic
+1  A: 

I recommend you use UTF8-CPP. It's a header-only library for working with UTF-8 in C++. With this lib, it would look something like this:

int LenghtOfUtf8String( const std::string &utf8_string ) 
{
    return utf8::distance( utf8_string.begin(), utf8_string.end() ); 
}

(Code is from the top of my head.)

Lucas