tags:

views:

172

answers:

3

Note: This is a question I’m asking more out of historical interest, as I realise that modern languages have built-in regular expressions and case insensitive string compare methods.

When comparing two strings of an unknown case, I can remember reading that Microsoft based conversion methods where optimized for uppercase rather than lowercase. So:

If (stringA.ToUpper() == stringB.ToUpper()) { ... }

would be quicker than:

If (stringA.ToLower() == stringB.ToLower()) { ... }

If this is true, would it be better to store string data in upper rather than lower case when you need to search it?

+9  A: 

In .NET we could do something like the following:

if (String.Compare(stringA, stringB, StringComparison.InvariantCultureIgnoreCase) == 0) {...}

and not need to worry about turning the strings into upper or lower case. More on this here.

jpoh
Definetly the way to do it
Henri
In general, when you want to be language independent you are better off using StringComparison.OrdinalIgnoreCase.
Richard
except that it is recommended to use OrdinalIgnoreCase for culture-agnostic string comparision. :)
Frederik Gheysels
+3  A: 

There is no safe case to use in the general case.

Whatever choice you make it will fail in some cases.

  • Some languages have no case (not really a problem).
  • Some languages have a third "title" case.
  • Some characters do not round trip, e.g. ToUpper("ß") is "SS", and ToLower("SS") is "ss", but there are some words only distingished by "ß" vs "ss" so will give a false positive is matched by mapping to upper case (and which will break assumptions about case mapping not changing string lengths).
  • Case mapping is language dependent. E.g. ToLower("I") is "i" unless you have working in Turkish or Azari where the result is "ı" (Latin Small Letter Dotless I) and ToUpper("i") is "İ" (Latin Capital Letter I With Dot Above).

In the past approaches based on ToUpper and ToLower where making assumptions about working in only English text and ignoring the majority of the worlds glyphs and characters. To be more enlightened you need to use case mapping tables as the basis for case-insensitive comparisons.

Richard
A: 

In ANSI/ASCII codes, uppercase letters have lower values than lowercase letters. The "A" is code 65 and the "a" is code 97. Binary 01000001 and 01100001.) The difference between lowercase and uppercase letters is thus a single bit.
But does this matter for speed? In all cases all 8 bits have to be compared. So any speed difference could be explained if comparing two bits is faster if both bits are 0. That doesn't make much sense to me but then again, in some older processors this could have been true in the past.
But nowadays? I don't think you'll notice any difference.


However, there could be a speed difference in converting lowercase to uppercase or vice versa. Especially when you have to support letters with accents or other non-ANSI letters. In these cases a special mapping must be used which might have been optimized for one direction. It's not the comparison that would be slow, it would be the convertion slowing things up.

Workshop Alex