tags:

views:

20204

answers:

20

What is the best way of doing case insensitive string comparison in C++ with out transforming a string to all upper or lower case?

Also, what ever methods you present, are they Unicode friendly? Are they portable?

+2  A: 

Assuming you are looking for a method and not a magic function that already exists, there is frankly no better way. We could all write code snippets with clever tricks for limited character sets, but at the end of the day at somepoint you have to convert the characters.

The best approach for this conversion is to do so prior to the comparison. This allows you a good deal of flexibility when it comes to encoding schemes, which your actual comparison operator should be ignorant of.

You can of course 'hide' this conversion behind your own string function or class, but you still need to convert the strings prior to comparison.

Andrew Grant
+5  A: 

Visual C++ string functions supporting unicode: http://msdn.microsoft.com/en-us/library/cc194799.aspx

the one you are probably looking for is _wcsnicmp

Darren Kopp
A: 

This link outlines the solution nicely.

Since it uses strcmp() function, which works on Unicode data, the resulting function would be Unicode friendly too.

Pascal
The only specific Unicode case that strcmp properly handle is when a string encoded with a byte-based encoding (like utf-8) contains only code points below U+00FF - then the byte-per-byte comparison is enough.
Johann Gerell
A: 

Pascal,

Neither strcmp nor any of the other str* functions work with Unicode. Even if they did strcmp is case-sensitive, and stricmp (the non-case sensitive) assumes ANSI encoding when performing the checks.

Andrew Grant
+1  A: 
Wedge
True, although overREADing a buffer is significantly less dangerous than overWRITEing a buffer.
Adam Rosenfield
+1  A: 

I've had good experience using the International Components for Unicode libraries - they're extremely powerful, and provide methods for conversion, locale support, date and time rendering, case mapping (which you don't seem to want), and collation, which includes case- and accent-insensitive comparison (and more). I've only used the C++ version of the libraries, but they appear to have a Java version as well.

Methods exist to perform normalized compares as referred to by @Coincoin, and can even account for locale - for example (and this a sorting example, not strictly equality), traditionally in Spanish (in Spain), the letter combination "ll" sorts between "l" and "m", so "lz" < "ll" < "ma".

Blair Conrad
+20  A: 

Are you talking about a dumb case insensitive compare or a full normalized Unicode compare?

A dumb compare will not find strings that might be the same but are not binary equal.

Example:

U212B (ANGSTROM SIGN)
U0041 (LATIN CAPITAL LETTER A) + U030A (COMBINING RING ABOVE)
U00C5 (LATIN CAPITAL LETTER A WITH RING ABOVE).

Are all equivalent but they also have different binary representations.

That said, Unicode Normalization should be a mandatory read especially if you plan on supporting Hangul, Thaï and other asian languages.

Also, IBM pretty much patented most optimized Unicode algorithms and made them publicly available. They also maintain an implementation : IBM ICU

Coincoin
+2  A: 

I'm trying to cobble together a good answer from all the posts, so help me edit this:

Here is a method of doing this but it will NOT be Unicode friendly. Although it does transforming the strings, and is not Unicode friendly, it should be portable which is a plus:

bool caseInsensitiveStringCompare( const std::string& str1, const std::string& str2 ) {
    std::string str1Cpy( str1 );
    std::string str2Cpy( str2 );
    std::transform( str1Cpy.begin(), str1Cpy.end(), str1Cpy.begin(), ::tolower );
    std::transform( str2Cpy.begin(), str2Cpy.end(), str2Cpy.begin(), ::tolower );
    return ( str1Cpy == str2Cpy );
}

From what I have read this is more portable than stricmp() because stricmp() is not in fact part of the std library, but only implemented by most compiler vendors.

To get a truly Unicode friendly implementation it appears you must go outside the std library. One good 3rd party library is the IBM ICU (International Components for Unicode)

Also boost::iequals provides a fairly good utility for doing this sort of comparison.

Adam
+5  A: 

If you are on a POSIX system, you can use strcasecmp. This function is not part of standard C, though, nor is it available on Windows. This will perform a case-insensitive comparison on 8-bit chars, so long as the local is POSIX. If the local is not POSIX, the results are undefined (so it might do a localized compare, or it might not). A wide-character equivalent is not available.

On Windows, you can use the _stricmp or _wcsicmp functions to perform case-insensitive comparison. These function do use the current locale information, and the wide-character version is available. However, these are obviously Windows-specific.

C and C++ are both largely ignorant of internationalization issues, so there's no good solution to this problem, except to use a third-party library. Check out IBM ICU (International Components for Unicode) if you need a robust library for C/C++. ICU is for both Windows and Unix systems.

Derek Park
A: 

@Adam

There is no good method of doing this that will be Unicode friendly with out transforming the strings, so I have written a function here that would in fact be Unicode safe and portable:

That's not going to work. If you want Unicode support, you have to move outside of the standard C/C++ libaries. They do not support unicode in any real sense. They simply weren't designed for unicode.

Derek Park
A: 

I think a better question would be why are you trying to do a case insensitive comparison with a character encoding that could have different semantics for capitalization?

MSN

Mat Noguchi
+1  A: 
Shadow2531
A: 

@Adam:

While this variant is good in terms of usability it's bad in terms of performance because it creates unnecessary copies. I might overlook something but I believe the best (non-Unicode) way is to use std::stricmp. Otherwise, read what Herb has to say.

Konrad Rudolph
+1  A: 

I wrote a case-insensitive version of char_traits for use with std::basic_string in order to generate a std::string that is not case-sensitive when doing comparisons, searches, etc using the built-in std::basic_string member functions.

So in other words, I wanted to do something like this.

std::string a = "Hello, World!";
std::string b = "hello, world!";

assert( a == b );

...which std::string can't handle. Here's the usage of my new char_traits:

std::istring a = "Hello, World!";
std::istring b = "hello, world!";

assert( a == b );

...and here's the implementation:

/*  ---

     Case-Insensitive char_traits for std::string's

     Use:

      To declare a std::string which preserves case but ignores case in comparisons & search,
      use the following syntax:

       std::basic_string<char, char_traits_nocase<char> > noCaseString;

      A typedef is declared below which simplifies this use for chars:

       typedef std::basic_string<char, char_traits_nocase<char> > istring;

    --- */

    template<class C>
    struct char_traits_nocase : public std::char_traits<C>
    {
     static bool eq( const C& c1, const C& c2 )
     { 
      return ::toupper(c1) == ::toupper(c2); 
     }

     static bool lt( const C& c1, const C& c2 )
     { 
      return ::toupper(c1) < ::toupper(c2);
     }

     static int compare( const C* s1, const C* s2, size_t N )
     {
      return _strnicmp(s1, s2, N);
     }

     static const char* find( const C* s, size_t N, const C& a )
        {
      for( size_t i=0 ; i<N ; ++i )
      {
       if( ::toupper(s[i]) == ::toupper(a) ) 
        return s+i ;
      }
      return 0 ;
     }

     static bool eq_int_type( const int_type& c1, const int_type& c2 )
     { 
      return ::toupper(c1) == ::toupper(c2) ; 
     }  
    };

    template<>
    struct char_traits_nocase<wchar_t> : public std::char_traits<wchar_t>
    {
     static bool eq( const wchar_t& c1, const wchar_t& c2 )
     { 
      return ::towupper(c1) == ::towupper(c2); 
     }

     static bool lt( const wchar_t& c1, const wchar_t& c2 )
     { 
      return ::towupper(c1) < ::towupper(c2);
     }

     static int compare( const wchar_t* s1, const wchar_t* s2, size_t N )
     {
      return _wcsnicmp(s1, s2, N);
     }

     static const wchar_t* find( const wchar_t* s, size_t N, const wchar_t& a )
        {
      for( size_t i=0 ; i<N ; ++i )
      {
       if( ::towupper(s[i]) == ::towupper(a) ) 
        return s+i ;
      }
      return 0 ;
     }

     static bool eq_int_type( const int_type& c1, const int_type& c2 )
     { 
      return ::towupper(c1) == ::towupper(c2) ; 
     }  
    };

    typedef std::basic_string<char, char_traits_nocase<char> > istring;
    typedef std::basic_string<wchar_t, char_traits_nocase<wchar_t> > iwstring;
John Dibling
This works for regular chars, but won't work for all of Unicode, as captitalization is not necessarily bidirectional (there's a good example in Greek involving sigma that I can't remember right now; something like it has two lower and one upper case, and you can't get a proper comparison either way)
coppro
That's really the wrong way to go about it. Case sensitivity should not be a property of the strings themselves. What happens when the same string object needs both case-sensitive and case insensitive comparisons?
Ferruccio
If case-sensitivity isn't appropriate to be "part of" the string, then neither is the find() function at all. Which, for you, might be true, and that's fine. IMO the greatest thing about C++ is that it doesn't force a particular paradigm on the programmer. It is what you want/need it to be.
John Dibling
Actually, I think most C++-guru's (like the ones on the standards committee) agree that it was a mistake to put find() in std::basic_string<> along with a whole lot of other things that could equally well be placed in free functions. Besides there are some issues with putting it in the type.
Andreas Magnusson
As others have pointed out, there are two major things wrong with this solution (ironically, one is the interface and the other is the implementation ;-)).
Konrad Rudolph
… but since Herb Sutter has made the same mistake and I've apparently even linked his article (I don't remember this!), I can't very well complain.
Konrad Rudolph
+33  A: 

Boost includes a handy algorithm for this:

#include <boost/algorithm/string.hpp>

std::string str1 = "hello, world!";
std::string str2 "HELLO, WORLD!";

if (boost::iequals(str1, str2))
{
    // Strings are identical
}
Rob
Is this UTF-8 friendly? I think not.
vladr
A: 

Just a note on whatever method you finally choose, if that method happens to include the use of strcmp that some answers suggest:

strcmp doesn't work with Unicode data in general. In general, it doesn't even work with byte-based Unicode encodings, such as utf-8, since strcmp only makes byte-per-byte comparisons and Unicode code points encoded in utf-8 can take more than 1 byte. The only specific Unicode case strcmp properly handle is when a string encoded with a byte-based encoding contains only code points below U+00FF - then the byte-per-byte comparison is enough.

Johann Gerell
A: 

You can use strcasecmp on Unix, or stricmp on Windows.

One thing that hasn't been mentioned so far is that if you are using stl strings with these methods, it's useful to first compare the length of the two strings, since this information is already available to you in the string class. This could prevent doing the costly string comparison if the two strings you are comparing aren't even the same length in the first place.

bradtgmurray
+5  A: 

The Boost.String library has a lot of algorithms for doing case-insenstive comparisons and so on.

You could implement your own, but why bother when it's already been done?

Dean Harding
There isn't a way built-in with std::string?
WilliamKF
No, there isn't.
Dean Harding
+11  A: 

Take advantage of the standard char_traits. Recall that a std::string is in fact a typedef for std::basic_string<char>, or more explicitly, std::basic_string<char, std::char_traits<char> >. The char_traits type describes how characters compare, how they copy, how they cast etc. All you need to do is typedef a new string over basic_string, and provide it with your own custom char_traits that compare case insensitively.

struct ci_char_traits : public char_traits<char> {
    static bool eq(char c1, char c2) { return toupper(c1) == toupper(c2); }
    static bool ne(char c1, char c2) { return toupper(c1) != toupper(c2); }
    static bool lt(char c1, char c2) { return toupper(c1) <  toupper(c2); }
    static int compare(const char* s1, const char* s2, size_t n) {
        while( n-- != 0 ) {
            if( toupper(*s1) < toupper(*s2) ) return -1;
            if( toupper(*s1) > toupper(*s2) ) return 1;
            ++s1; ++s2;
        }
        return 0;
    }
    static const char* find(const char* s, int n, char a) {
        while( n-- > 0 && toupper(*s) != toupper(a) ) {
            ++s;
        }
        return s;
    }
};

typedef std::basic_string<char, ci_char_traits> ci_string;

The details are on Guru of The Week number 29.

wilhelmtell