Case insensitive search in Unicode in C++ on Windows

I asked a similar question yesterday, but recognize that i need to rephase it in a different way.

In short: In C++ on Windows, how do I do a case-insensitive search for a string (inside another string) when the strings are in unicode format (wide char, wchar_t), and I don't know the language of the strings. I just want to know whether the needle exists in the haystack. Location of the needle isn't relevant to me.

Background: I have a repository containing a lot of email bodies. The messages are in different languages (japanese, german, russian, finnish; you name it). All the data is in Unicode format, and I load it to wide strings (wchar_t) in my C++ application (the bodies have been MIME decoded, so in my debugger I can see the actual japanese, german characters). I don't know the language of the messages since email messages doensn't contain that detail, also a single email body may contain characters from several languages.

I'm looking for something like wcsstr, but with the ability to do the search in a case insensitve manner. I know that it's not possible to do a 100% proper conversion from upper case to lower case, without knowing the language of the text. I want a solution which works in the 99% cases where it's possible.

I'm using Visual Studio 2008 with C++, STL and Boost.

I don't want to bundle a new large library just to do this. I'm looking for a solution which is available in Boost or in the Windows APIs.

Nitramk 2009-10-24 12:50:11

I downloaded http://download.icu-project.org/files/icu4c/4.2.1/icu4c-4_2_1-Win32-msvc9.zip to check, and the .lib files add up to about 200K and the DLLs add up to about 20M. That's not a lot in this day and age, and you may not actually need all of them for what you are doing. In any case, ICU is the right way to do Unicode.

Michael Dillon 2009-10-24 13:52:28

Considering the scope of what I'm trying to do, what would the problem be with Ferruccios solution to which ICU solves?

Nitramk 2009-10-24 16:52:06

The icontains documentation says that it handles case insensitive matches only within a single locale. Since you are dealing with messages in many languages, it might not work. Of course, if you have the language identity recorded along with the message, then you may be able to do it with icontains(). ICU is a full-blown solution to UNICODE text manipulation and using it pays off in the future when you can apply it to many other problems.

Michael Dillon 2009-10-24 20:36:28

Well, as i mentioned in my question I don't know the language of the messages. And since it's theoretically impossible to do a 100% proper case conversion without knowing the language, I still don't understand what ICU adds over the icontains. To me it sounds like terrible engineeing to include a 20MB library to do a string search because I may need other parts from that 20MB library some time in the future.

Nitramk 2009-10-25 14:09:21

According to the Boost docs in another answer, icontains() requires the locale to be specified. If you don't have a locale then ICU allows for a nonspecific case-mapping that is better than nothing at all. The UNICODE spec covers case algorithms here http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf#G33992 and that is what ICU implements. You can use the simple case mapping defined here http://userguide.icu-project.org/transforms/casemappings and if you don't want to use full regular expressions, you can do a search http://userguide.icu-project.org/collation/icu-string-search-service

Michael Dillon 2009-10-25 14:54:50

My question was apparantly too long this time. As I point out in my question, I'm well aware that I need to know the language to do it 100% properly. But since this technically impossible, I'm asking for a solution which will work 99% of the time.

Nitramk 2009-10-25 14:04:31

What is the source of the strings that you are searching for? If they are provided by a user, then the user's locale is probably appropriate. Your question also doesn't explain why you think a case insensitive search is required.

Mark Thornton 2009-10-25 19:41:01

ansaurus

tags:

views:

answers:

Case insensitive search in Unicode in C++ on Windows

related questions