I am looking for a method to compare and sort UTF-8 strings in C++ in a case-insensitive manner to use it in a custom collation function in SQLite.
- The method should ideally be locale-independent. However I won't be holding my breath, as far as I know, collation is very language-dependent, so anything that works on languages other than English will do, even if it means switching locales.
- Options include using standard C or C++ library or a small (suitable for embedded system) and non-GPL (suitable for a proprietary system) third-party library.
What I have so far:
strcoll
with C locales andstd::collate
/std::collate_byname
are case-sensitive. (Are there case-insensitive versions of these?)I tried to use a POSIX strcasecmp, but it seems to be not defined for locales other than
"POSIX"
In the POSIX locale, strcasecmp() and strncasecmp() do upper to lower conversions, then a byte comparison. The results are unspecified in other locales.
And, indeed, the result of
strcasecmp
does not change between locales on Linux with GLIBC.#include <clocale> #include <cstdio> #include <cassert> #include <cstring> const static char *s1 = "Äaa"; const static char *s2 = "äaa"; int main() { printf("strcasecmp('%s', '%s') == %d\n", s1, s2, strcasecmp(s1, s2)); printf("strcoll('%s', '%s') == %d\n", s1, s2, strcoll(s1, s2)); assert(setlocale(LC_ALL, "en_AU.UTF-8")); printf("strcasecmp('%s', '%s') == %d\n", s1, s2, strcasecmp(s1, s2)); printf("strcoll('%s', '%s') == %d\n", s1, s2, strcoll(s1, s2)); assert(setlocale(LC_ALL, "fi_FI.UTF-8")); printf("strcasecmp('%s', '%s') == %d\n", s1, s2, strcasecmp(s1, s2)); printf("strcoll('%s', '%s') == %d\n", s1, s2, strcoll(s1, s2)); }
This is printed:
strcasecmp('Äaa', 'äaa') == -32 strcoll('Äaa', 'äaa') == -32 strcasecmp('Äaa', 'äaa') == -32 strcoll('Äaa', 'äaa') == 7 strcasecmp('Äaa', 'äaa') == -32 strcoll('Äaa', 'äaa') == 7
P. S.
And yes, I am aware about ICU, but we can't use it on the embedded platform due to its enormous size.