views:

251

answers:

4

I asked a similar question yesterday, but recognize that i need to rephase it in a different way.

In short: In C++ on Windows, how do I do a case-insensitive search for a string (inside another string) when the strings are in unicode format (wide char, wchar_t), and I don't know the language of the strings. I just want to know whether the needle exists in the haystack. Location of the needle isn't relevant to me.

Background: I have a repository containing a lot of email bodies. The messages are in different languages (japanese, german, russian, finnish; you name it). All the data is in Unicode format, and I load it to wide strings (wchar_t) in my C++ application (the bodies have been MIME decoded, so in my debugger I can see the actual japanese, german characters). I don't know the language of the messages since email messages doensn't contain that detail, also a single email body may contain characters from several languages.

I'm looking for something like wcsstr, but with the ability to do the search in a case insensitve manner. I know that it's not possible to do a 100% proper conversion from upper case to lower case, without knowing the language of the text. I want a solution which works in the 99% cases where it's possible.

I'm using Visual Studio 2008 with C++, STL and Boost.

+1  A: 

Boost String Algorithms has an icontains() function template which may do what you need.

Ferruccio
So does that work with unicode strings?
Nitramk
It will work with both wchar_t* and std::wstring types or anything derived from std::basic_string<>.
Ferruccio
A: 

You should use the ICU library which provides support for Unicode regular expressions which follow the Unicode rules for case-insensitive matching. The library is available as C/C++ and Java libraries. Many other languages such as Python support a wrapper for the ICU libraries.

Michael Dillon
I don't want to bundle a new large library just to do this. I'm looking for a solution which is available in Boost or in the Windows APIs.
Nitramk
I downloaded http://download.icu-project.org/files/icu4c/4.2.1/icu4c-4_2_1-Win32-msvc9.zip to check, and the .lib files add up to about 200K and the DLLs add up to about 20M. That's not a lot in this day and age, and you may not actually need all of them for what you are doing. In any case, ICU is the right way to do Unicode.
Michael Dillon
Considering the scope of what I'm trying to do, what would the problem be with Ferruccios solution to which ICU solves?
Nitramk
The icontains documentation says that it handles case insensitive matches only within a single locale. Since you are dealing with messages in many languages, it might not work. Of course, if you have the language identity recorded along with the message, then you may be able to do it with icontains(). ICU is a full-blown solution to UNICODE text manipulation and using it pays off in the future when you can apply it to many other problems.
Michael Dillon
Well, as i mentioned in my question I don't know the language of the messages. And since it's theoretically impossible to do a 100% proper case conversion without knowing the language, I still don't understand what ICU adds over the icontains. To me it sounds like terrible engineeing to include a 20MB library to do a string search because I may need other parts from that 20MB library some time in the future.
Nitramk
According to the Boost docs in another answer, icontains() requires the locale to be specified. If you don't have a locale then ICU allows for a nonspecific case-mapping that is better than nothing at all. The UNICODE spec covers case algorithms here http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf#G33992 and that is what ICU implements. You can use the simple case mapping defined here http://userguide.icu-project.org/transforms/casemappings and if you don't want to use full regular expressions, you can do a search http://userguide.icu-project.org/collation/icu-string-search-service
Michael Dillon
A: 

you could convert both needle and haystack to lowercase (or uppercase) then do the wcsstr().

Serge - appTranslator
+3  A: 

You have to specify the language to do case insensitive comparison. For example in Turkish, 'i' is NOT the lower case letter corresponding to 'I'. If the language appears not to be specified, then the comparison is being done with an implicitly selected language.

Mark Thornton
My question was apparantly too long this time. As I point out in my question, I'm well aware that I need to know the language to do it 100% properly. But since this technically impossible, I'm asking for a solution which will work 99% of the time.
Nitramk
What is the source of the strings that you are searching for? If they are provided by a user, then the user's locale is probably appropriate. Your question also doesn't explain why you think a case insensitive search is required.
Mark Thornton