tags:

views:

143

answers:

3

Anyone know of a more permissive license (MIT / public domain) version of this:

http://library.gnome.org/devel/glibmm/unstable/classGlib_1_1ustring.html

('drop-in' replacement for std::string thats UTF-8 aware)

Lightweight, does everything I need and even more (doubt I'll use the UTF-XX conversions even)

I really don't want to be carrying ICU around with me.

+1  A: 

You might be interested in the Flexible and Economical UTF-8 Decoder by Björn Höhrmann but by no mean it's a drop-in replacement for std::string.

Gregory Pakosz
std::wstring is easier if-and-only-if you work on Windows, on rest of the platforms std::string and UTF-8 is the correct way.
Artyom
Yeah UTF-16 would give me more problems than solutions.Looking at flexible UTF-8 decoder though, thanks :)
xenthral
@Artyom except that storing UTF-8 bytes in `std::string` doesn't make the class count the characters correctly for instance whereas `std::wstring` works
Gregory Pakosz
@Gregory Pakosz - not it does not `wchar_t` != `code-point` and of course not a character. Some code-points represented as two `wchar_t`. UTF-16 as variable length encoding as UTF-8 is. So UTF-16 just will make you code more buggy.
Artyom
@Artyom: It's easier to get UTF-16 correct than UTF-8, if you care about security. It's also faster to process and smaller for anything not Latin or Arabic. Of course, go with the OS if you can save conversion, but there are very good reasons to pick UTF-16 over UTF-8 for internal use.
Simon Buchan
@Simon Buchan Read this carefully, including accepted answer before you suggest to use UTF-16: http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful
Artyom
@Artyom: ... yes? You have a question stating your position, and you accepted an answer agreeing with you. Notepad on XP not correctly handling astral characters is hardly a reason to never ever use UTF-16. My point is, try to write a UTF-8 decoder that doesn't accept overlong sequences, and compare with the equivalent UTF-16 decoder.
Simon Buchan
@Simon Buchan, the problem is not Windows XP notepad, the problem is that **every second** application that uses UTF-16 fails on surrogate pairs. UTF-16 is just a big mistake because once-upon-a-time Unicode was indeed 16 bit. And a read my answer carefully, you rarely need to use code-points as is for many reasons. So best just work either with UTF-8 and use existing code for code-points iteration or use UTF-32. And to be honest? Stop thinking that Unicode support is something about knowing how to separate text in code-points.
Artyom
@Artyom: If it's not about codepoints, why are you complaining about how codepoints are stored in bytes?
Simon Buchan
@Artyom > you're right about wchar_t, my bad, just don't know why I didn't keep in mind UTF-16 is variable length as well
Gregory Pakosz
+5  A: 
  1. std::string is fine for UTF-8 storage.
  2. If you need to analyze the text itself, the UTF-8 awareness will not help you much as there are too many things in Unicode that do not work on codepoint base.

Take a look on Boost.Locale library (it uses ICU under the hood):

It is not lightweight but it allows you handle Unicode correctly and it uses std::string as storage.

If you expect to find Unicode-aware lightweight library to deal with strings, you'll not find such things, because Unicode is not lightweight. And even relatively "simple" stuff like upper-case, lower-case conversion or Unicode normalization require complex algorithms and Unicode data-base access.

If you need an ability to iterate over Code points (that BTW are not characters) take a look on http://utfcpp.sourceforge.net/

Answer to comment:

1) Find file formats for files included by me

std::string::find is perfectly fine for this.

2) Line break detection

This is not a simple issue. Have you ever tried to find a line-break in Chinese/Japanese text? Probably not as space does not separate words. So line-break detection is hard job. (I don't think even glib does this correctly, I think only pango has something like that)

And of course Boost.Locale does this and correctly.

And if you need to do this for European languages only, just search for space or punctuation marks, so std::string::find is more then fine.

3) Character (or now, code point) counting Looking at utfcpp thx

Characters are not code points, for example a Hebrew word Shalom -- "שָלוֹם" consists of 4 characters and 6 Unicode points, where two code points are used for vowels. Same for European languages where singe character and be represented with two code points, for example: "ü" can be represented as "u" and "¨" -- two code points.

So if you are aware of these issues then utfcpp will be fine, otherwise you will not find anything simpler.

Artyom
"and if you expect to find Unicode-aware lightweight library to deal with strings, you'll not find such things"Except I did - as linked, but, alas its GPL'ed..."...And even "simple" stuff like upper-case, lower-case conversion or normalization..."Nah I definitely don't need things like thatI think I only look inside my std::strings to:1) Find file formats for files included by me (so, its definitely '.png' '.jpg', lowercase as I shipped it)2) Line break detection3) Character (or now, code point) countingLooking at utfcpp thx!
xenthral
@xenthral See my answer.
Artyom
on 2) - yeah I was aware that line break detection can fail, so I made sure it would at worst case just break anywhere it could, so thats not a problem.on 3) oops, I got lost on the jargon there a bit, thanks for the correction.Think utfcpp will do fine then.
xenthral
Very good analysis, too many people (and I did not so long ago) don't understand the difference between a grapheme (the character drawn) and Unicode code points... all because of those terrible diacritics (I am French).
Matthieu M.
+2  A: 

I never used, but stumbled upon this UTF-8 CPP library a while ago, and had enough good feelings to bookmark it. It is released on a BSD like license IIUC.

It still relies on std::string for strings and provides lots of utility functions to help checking that the string is really UTF-8, to count the number of characters, to go back or forward by one character … It is really small, lives only in header files: looks really good!

Didier Trosset