- std::string is fine for UTF-8 storage.
- If you need to analyze the text itself, the UTF-8 awareness will not help you much as there are too many things in Unicode that do not work on codepoint base.
Take a look on Boost.Locale library (it uses ICU under the hood):
It is not lightweight but it allows you handle Unicode correctly and it uses std::string
as storage.
If you expect to find Unicode-aware lightweight library to deal with strings, you'll not find such things, because Unicode is not lightweight. And even relatively "simple" stuff like upper-case, lower-case conversion or Unicode normalization require complex algorithms and Unicode data-base access.
If you need an ability to iterate over Code points (that BTW are not characters)
take a look on http://utfcpp.sourceforge.net/
Answer to comment:
1) Find file formats for files included by me
std::string::find is perfectly fine for this.
2) Line break detection
This is not a simple issue. Have you ever tried to find a line-break in Chinese/Japanese text? Probably not as space does not separate words. So line-break detection is hard job. (I don't think even glib does this correctly, I think only pango has something like that)
And of course Boost.Locale does this and correctly.
And if you need to do this for European languages only, just search for space or punctuation marks, so std::string::find
is more then fine.
3) Character (or now, code point) counting Looking at utfcpp thx
Characters are not code points, for example a Hebrew word Shalom -- "שָלוֹם" consists of 4 characters and 6 Unicode points, where two code points are used for vowels. Same for European languages where singe character and be represented with two code points, for example: "ü" can be represented as "u" and "¨" -- two code points.
So if you are aware of these issues then utfcpp will be fine, otherwise you will not
find anything simpler.