views:

3421

answers:

8

I am looking for a method to compare and sort UTF-8 strings in C++ in a case-insensitive manner to use it in a custom collation function in SQLite.

  1. The method should ideally be locale-independent. However I won't be holding my breath, as far as I know, collation is very language-dependent, so anything that works on languages other than English will do, even if it means switching locales.
  2. Options include using standard C or C++ library or a small (suitable for embedded system) and non-GPL (suitable for a proprietary system) third-party library.

What I have so far:

  1. strcoll with C locales and std::collate/std::collate_byname are case-sensitive. (Are there case-insensitive versions of these?)
  2. I tried to use a POSIX strcasecmp, but it seems to be not defined for locales other than "POSIX"

    In the POSIX locale, strcasecmp() and strncasecmp() do upper to lower conversions, then a byte comparison. The results are unspecified in other locales.

    And, indeed, the result of strcasecmp does not change between locales on Linux with GLIBC.

    #include <clocale>
    #include <cstdio>
    #include <cassert>
    #include <cstring>
    
    
    const static char *s1 = "Äaa";
    const static char *s2 = "äaa";
    
    
    int main() {
        printf("strcasecmp('%s', '%s') == %d\n", s1, s2, strcasecmp(s1, s2));
        printf("strcoll('%s', '%s') == %d\n", s1, s2, strcoll(s1, s2));
        assert(setlocale(LC_ALL, "en_AU.UTF-8"));
        printf("strcasecmp('%s', '%s') == %d\n", s1, s2, strcasecmp(s1, s2));
        printf("strcoll('%s', '%s') == %d\n", s1, s2, strcoll(s1, s2));
        assert(setlocale(LC_ALL, "fi_FI.UTF-8"));
        printf("strcasecmp('%s', '%s') == %d\n", s1, s2, strcasecmp(s1, s2));
        printf("strcoll('%s', '%s') == %d\n", s1, s2, strcoll(s1, s2));
    }
    

    This is printed:

    strcasecmp('Äaa', 'äaa') == -32
    strcoll('Äaa', 'äaa') == -32
    strcasecmp('Äaa', 'äaa') == -32
    strcoll('Äaa', 'äaa') == 7
    strcasecmp('Äaa', 'äaa') == -32
    strcoll('Äaa', 'äaa') == 7
    

P. S.

And yes, I am aware about ICU, but we can't use it on the embedded platform due to its enormous size.

A: 

I don't think there's a standard C/C++ library function you can use. You'll have to roll your own or use a 3rd-party library. The full Unicode specification for locale-specific collation can be found here: http://www.unicode.org/reports/tr10/ (warning: this is a long document).

Adam Rosenfield
A: 

I agree there's no standard C/C++ library function for this, but that might not be the main problem in this posting.

From the amount of code posted, it's not obvious if the calls to assert really did anything. Maybe add something like this:

assert(printf("Debugging is enabled.\n");
Windows programmer
asserts are definitely enabled
Alex B
A: 

The GoTW #29 explains how you can create your own case-insensitive string class, simply by writing your own traits for comparing characters. This way you have your own string which is exactly the same as the stl one, except that comparisons are case-insensitive, and with less than 20 lines of code !

Luc Touraille
It is not the string abstraction, but rather the *implementation of comparison function* that is the problem. The implementation in the link will only ever work with ASCII characters, since it uses C library "toupper" and POSIX "memicmp".
Alex B
A: 

On Windows you can call fall back on the OS function CompareStringW and use the NORM_IGNORECASE flag. You'll have to convert your UTF-8 strings to UTF-16 first. Otherwise, take a look at IBM's International Components for Unicode.

Harold Ekstrom
A: 

I believe you will need to roll your own or use an third party library. I recommend a third party library because there are a lot of rules that need to be followed to get true international support - best to let someone who is an expert deal with them.

Ray
A: 

I have no definitive answer in the form of example code, but I should point out that an UTF-8 bytestream contains, in fact, Unicode characters and you have to use the wchar_t versions of the C/C++ runtime library.

You have to convert those UTF-8 bytes into wchar_t strings first, though. This is not very hard, as the UTF-8 encoding standard is very well documented. I know this, because I've done it, but I can't share that code with you.

Dave Van den Eynde
+2  A: 

What you really want is logically impossible. There is no locale-independent, case-insensitive way of sorting strings. The simple counter-example is "i" <> "I" ? The naive answer is no, but in Turkish these strings are unequal. "i" is uppercased to "İ" (U+130 Latin Capital I with dot above)

UTF-8 strings add extra complexity to the question. They're perfectly valid multi-byte char* strings, if you have an appropriate locale. But neither the C nor the C++ standard defines such a locale; check with your vendor (too many embedded vendors, sorry, no genearl answer here). So you HAVE to pick a locale whose multi-byte encoding is UTF-8, for the mbscmp function to work. This of course influences the sort order, which is locale dependent. And if you have NO locale in which const char* is UTF-8, you can't use this trick at all. (As I understand it, Microsoft's CRT suffers from this. Their multi-byte code only handles characters up to 2 bytes; UTF-8 needs 3)

wchar_t is not the standard solution either. It supposedly is so wide that you don't have to deal with multi-byte encodings, but your collation will still depend on locale (LC_COLLATE) . However, using wchar_t means you now choose locales that do not use UTF-8 for const char*.

With this done, you can basically write your own ordering by converting strings to lowercase and comparing them. It's not perfect. Do you expect L"ß" == L"ss" ? They're not even the same length. Yet, for a German you have to consider them equal. Can you live with that?

MSalters
About your example with the German "ß" character (and all such abundant cases): these must have been "solved" or otherwise dealt with thousands of times before, UTF-8 or no. MS Word has always had a "toggle case" feature - how did it work on that character in pre-Unicode versions? How did WordPerfect?I am having the same problem as the OP, except I work in Delphi. I've seen a number of Windows sqlite-based apps that perform a case-insensitive SELECT (and I guess ORDER BY), whether they are installed in an English, German or (in my case) Polish locale. Try Firefox :) How do they do that?
moodforaday
Usually incorrect :) Polish has IIRC no hard cases; all non-ASCII characters used in Polish are "based on" ASCII characters.
MSalters
A: 

If you are using it to do searching and sorting for your locale only, I suggest your function to call a simple replace function that convert both multi-byte strings into one byte per char ones using a table like:

A -> a
à -> a
á -> a
ß -> ss
Ç -> c
and so on

Then simply call strcmp and return the results.