How do I get the number of visible characters from a UTF-8 encoded char*?

tags:

c++
utf-8

views:

answers:

+3 Q:

How do I get the number of visible characters from a UTF-8 encoded char*?

I have a UTF-8 encoded char*.

Is there a standard function to calculate the number of visible characters represented by the byte array?

I'm on Red Hat (RHEL 5).

+1 A:

Yes, glib.h, has g_utf8_strlen()

Check out this page for more information (including three implementations of an algo to do this)

Evan Carroll 2010-06-07 20:17:31

The documentation you pointed to doesn't indicate whether it counts non-independent characters like combining characters or diacritics. Do you know whether it does?

Bill 2010-06-07 21:02:48

I think the utf8 parlance for what you speak of is graphemes, without the use of that term I would not take it to mean combining characters or diacritics.

Evan Carroll 2010-06-07 21:55:30

g_utf8_strlen() only counts code points within a UTF-8 string. Thus, if you had "e\u0341" (U+0341 is a combining acute accent), you'd get 2, while the string itself will display as é.

Thanatos 2010-06-08 00:12:14

That whole second link seems to be a lesson in premature optimization.

Thanatos 2010-06-08 00:16:59

I disagree with this latter remark. Library code is wholly different from application code: it ought to be as fast as possible or it'll be shunned from critical performance code. Furthermore, I would argue it's not premature in the sense that it's been thought and measured and is not the result of a misconception.

Matthieu M. 2010-06-08 06:21:16

+1 A:

Check the iconv library: man iconv_open. One can convert the utf-8 string into say UCS-2 or UCS-4 where characters are of the same size. iconv is also (relatively) portable and not Linux or GNU specific.

If Glib, suggested before, is available to you (beware: it is GPLed) then use it as it is a better way.

Dummy00001 2010-06-07 23:58:23

glib is LGPL, not GPL.

Thanatos 2010-06-08 00:15:02

Oh. Mea culpa. Then forget the iconv stuff.

Dummy00001 2010-06-08 08:56:49

ansaurus

tags:

views:

answers:

How do I get the number of visible characters from a UTF-8 encoded char*?

related questions