C++ UTF-8 lightweight & permissive code?

std::string is fine for UTF-8 storage.
If you need to analyze the text itself, the UTF-8 awareness will not help you much as there are too many things in Unicode that do not work on codepoint base.

Take a look on Boost.Locale library (it uses ICU under the hood):

Reference http://cppcms.sourceforge.net/boost_locale/html/
Tutorial http://cppcms.sourceforge.net/boost_locale/html/tutorial.html
Download https://sourceforge.net/projects/cppcms/files/

It is not lightweight but it allows you handle Unicode correctly and it uses std::string as storage.

If you expect to find Unicode-aware lightweight library to deal with strings, you'll not find such things, because Unicode is not lightweight. And even relatively "simple" stuff like upper-case, lower-case conversion or Unicode normalization require complex algorithms and Unicode data-base access.

If you need an ability to iterate over Code points (that BTW are not characters) take a look on http://utfcpp.sourceforge.net/

Answer to comment:

1) Find file formats for files included by me

std::string::find is perfectly fine for this.

2) Line break detection

This is not a simple issue. Have you ever tried to find a line-break in Chinese/Japanese text? Probably not as space does not separate words. So line-break detection is hard job. (I don't think even glib does this correctly, I think only pango has something like that)

And of course Boost.Locale does this and correctly.

And if you need to do this for European languages only, just search for space or punctuation marks, so std::string::find is more then fine.

3) Character (or now, code point) counting Looking at utfcpp thx

Characters are not code points, for example a Hebrew word Shalom -- "שָלוֹם" consists of 4 characters and 6 Unicode points, where two code points are used for vowels. Same for European languages where singe character and be represented with two code points, for example: "ü" can be represented as "u" and "¨" -- two code points.

So if you are aware of these issues then utfcpp will be fine, otherwise you will not find anything simpler.

std::wstring is easier if-and-only-if you work on Windows, on rest of the platforms std::string and UTF-8 is the correct way.

Artyom 2010-06-08 11:27:11

Yeah UTF-16 would give me more problems than solutions.Looking at flexible UTF-8 decoder though, thanks :)

xenthral 2010-06-08 11:32:41

@Artyom except that storing UTF-8 bytes in `std::string` doesn't make the class count the characters correctly for instance whereas `std::wstring` works

Gregory Pakosz 2010-06-08 11:41:38

@Gregory Pakosz - not it does not `wchar_t` != `code-point` and of course not a character. Some code-points represented as two `wchar_t`. UTF-16 as variable length encoding as UTF-8 is. So UTF-16 just will make you code more buggy.

Artyom 2010-06-08 11:44:53

@Artyom: It's easier to get UTF-16 correct than UTF-8, if you care about security. It's also faster to process and smaller for anything not Latin or Arabic. Of course, go with the OS if you can save conversion, but there are very good reasons to pick UTF-16 over UTF-8 for internal use.

Simon Buchan 2010-06-08 12:01:55

@Simon Buchan Read this carefully, including accepted answer before you suggest to use UTF-16: http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful

Artyom 2010-06-08 12:48:35

@Artyom: ... yes? You have a question stating your position, and you accepted an answer agreeing with you. Notepad on XP not correctly handling astral characters is hardly a reason to never ever use UTF-16. My point is, try to write a UTF-8 decoder that doesn't accept overlong sequences, and compare with the equivalent UTF-16 decoder.

Simon Buchan 2010-06-09 07:39:08

@Simon Buchan, the problem is not Windows XP notepad, the problem is that **every second** application that uses UTF-16 fails on surrogate pairs. UTF-16 is just a big mistake because once-upon-a-time Unicode was indeed 16 bit. And a read my answer carefully, you rarely need to use code-points as is for many reasons. So best just work either with UTF-8 and use existing code for code-points iteration or use UTF-32. And to be honest? Stop thinking that Unicode support is something about knowing how to separate text in code-points.

Artyom 2010-06-09 07:52:11

@Artyom: If it's not about codepoints, why are you complaining about how codepoints are stored in bytes?

Simon Buchan 2010-06-09 09:19:59

@Artyom > you're right about wchar_t, my bad, just don't know why I didn't keep in mind UTF-16 is variable length as well

Gregory Pakosz 2010-07-13 12:40:36

"and if you expect to find Unicode-aware lightweight library to deal with strings, you'll not find such things"Except I did - as linked, but, alas its GPL'ed..."...And even "simple" stuff like upper-case, lower-case conversion or normalization..."Nah I definitely don't need things like thatI think I only look inside my std::strings to:1) Find file formats for files included by me (so, its definitely '.png' '.jpg', lowercase as I shipped it)2) Line break detection3) Character (or now, code point) countingLooking at utfcpp thx!

xenthral 2010-06-08 11:29:03

@xenthral See my answer.

Artyom 2010-06-08 11:42:31

on 2) - yeah I was aware that line break detection can fail, so I made sure it would at worst case just break anywhere it could, so thats not a problem.on 3) oops, I got lost on the jargon there a bit, thanks for the correction.Think utfcpp will do fine then.

xenthral 2010-06-08 11:47:31

Very good analysis, too many people (and I did not so long ago) don't understand the difference between a grapheme (the character drawn) and Unicode code points... all because of those terrible diacritics (I am French).

Matthieu M. 2010-06-08 12:50:21

ansaurus

tags:

views:

answers:

C++ UTF-8 lightweight & permissive code?

related questions