ansaurus

Question

Case-insensitive UTF-8 string collation for SQLite (C/C++)

Answer 1

A:

I don't think there's a standard C/C++ library function you can use. You'll have to roll your own or use a 3rd-party library. The full Unicode specification for locale-specific collation can be found here: http://www.unicode.org/reports/tr10/ (warning: this is a long document).

Adam Rosenfield 2008-10-08 02:15:37

Answer 2

A:

I agree there's no standard C/C++ library function for this, but that might not be the main problem in this posting.

From the amount of code posted, it's not obvious if the calls to assert really did anything. Maybe add something like this:

assert(printf("Debugging is enabled.\n");

Windows programmer 2008-10-08 02:22:12

asserts are definitely enabled

Alex B 2008-10-08 02:28:27

Answer 3

A:

The GoTW #29 explains how you can create your own case-insensitive string class, simply by writing your own traits for comparing characters. This way you have your own string which is exactly the same as the stl one, except that comparisons are case-insensitive, and with less than 20 lines of code !

Luc Touraille 2008-10-08 14:39:58

It is not the string abstraction, but rather the *implementation of comparison function* that is the problem. The implementation in the link will only ever work with ASCII characters, since it uses C library "toupper" and POSIX "memicmp".

Alex B 2008-10-08 21:30:46

Answer 4

A:

On Windows you can call fall back on the OS function CompareStringW and use the NORM_IGNORECASE flag. You'll have to convert your UTF-8 strings to UTF-16 first. Otherwise, take a look at IBM's International Components for Unicode.

Harold Ekstrom 2008-10-09 12:02:42

Answer 5

A:

I believe you will need to roll your own or use an third party library. I recommend a third party library because there are a lot of rules that need to be followed to get true international support - best to let someone who is an expert deal with them.

Ray 2008-10-09 13:00:33

Answer 6

A:

I have no definitive answer in the form of example code, but I should point out that an UTF-8 bytestream contains, in fact, Unicode characters and you have to use the wchar_t versions of the C/C++ runtime library.

You have to convert those UTF-8 bytes into wchar_t strings first, though. This is not very hard, as the UTF-8 encoding standard is very well documented. I know this, because I've done it, but I can't share that code with you.

Dave Van den Eynde 2008-10-10 11:50:35

Answer 7

+2 A:

What you really want is logically impossible. There is no locale-independent, case-insensitive way of sorting strings. The simple counter-example is "i" <> "I" ? The naive answer is no, but in Turkish these strings are unequal. "i" is uppercased to "İ" (U+130 Latin Capital I with dot above)

UTF-8 strings add extra complexity to the question. They're perfectly valid multi-byte char* strings, if you have an appropriate locale. But neither the C nor the C++ standard defines such a locale; check with your vendor (too many embedded vendors, sorry, no genearl answer here). So you HAVE to pick a locale whose multi-byte encoding is UTF-8, for the mbscmp function to work. This of course influences the sort order, which is locale dependent. And if you have NO locale in which const char* is UTF-8, you can't use this trick at all. (As I understand it, Microsoft's CRT suffers from this. Their multi-byte code only handles characters up to 2 bytes; UTF-8 needs 3)

wchar_t is not the standard solution either. It supposedly is so wide that you don't have to deal with multi-byte encodings, but your collation will still depend on locale (LC_COLLATE) . However, using wchar_t means you now choose locales that do not use UTF-8 for const char*.

With this done, you can basically write your own ordering by converting strings to lowercase and comparing them. It's not perfect. Do you expect L"ß" == L"ss" ? They're not even the same length. Yet, for a German you have to consider them equal. Can you live with that?

MSalters 2008-10-10 13:28:08

About your example with the German "ß" character (and all such abundant cases): these must have been "solved" or otherwise dealt with thousands of times before, UTF-8 or no. MS Word has always had a "toggle case" feature - how did it work on that character in pre-Unicode versions? How did WordPerfect?I am having the same problem as the OP, except I work in Delphi. I've seen a number of Windows sqlite-based apps that perform a case-insensitive SELECT (and I guess ORDER BY), whether they are installed in an English, German or (in my case) Polish locale. Try Firefox :) How do they do that?

moodforaday 2009-10-17 23:19:23

Usually incorrect :) Polish has IIRC no hard cases; all non-ASCII characters used in Polish are "based on" ASCII characters.

MSalters 2009-10-19 08:19:51

Answer 8

A:

If you are using it to do searching and sorting for your locale only, I suggest your function to call a simple replace function that convert both multi-byte strings into one byte per char ones using a table like:

A -> a
Ã -> a
á -> a
ß -> ss
Ç -> c
and so on

Then simply call strcmp and return the results.

2009-02-16 09:37:57

ansaurus

tags:

views:

answers:

Case-insensitive UTF-8 string collation for SQLite (C/C++)

related questions