views:

358

answers:

3

Is there a case-insensitive variant of the Bob Jenkins hash function?

Generics.Defaults.BobJenkinsHash

provides a fast hash function. Unfortunately it cannot be used in combination with a case-insensitive compare function like so

TCustomStringComparer = class (TEqualityComparer <String>)
  function Equals(const Left, Right: String): Boolean; override;
  function GetHashCode(const Value: String): Integer; override;
end;
function TCustomStringComparer.Equals (const Left, Right : String) : Boolean;
begin
  Result := CompareText (Left, Right) = 0;
end;
function TCustomStringComparer.GetHashCode (const Value : String) : Integer;
begin
  Result := Generics.Defaults.BobJenkinsHash (Value [1], Length (Value) * SizeOf (Char), 0);
end;

This is because TDictionary first compares the hash codes and then uses the provided comparer when checking for equality.

Of course I could use UpperCase in my GetHashCode function, but I wondered if it would be faster if I could somehow modify the hash function itself.

+5  A: 

No, there is no case invariant version of the hash function. Lower or upper case all strings before passing them to the hash function.

Niels Castle
That's what I wrote in my last paragraph...
Smasher
Yes, and I am confirming it as an solution.
Niels Castle
+2  A: 

It would be slightly faster, but it hurts your maintainability a lot. There is rarely a good reason for this type of micro-optimization. Just convert your strings to lower or upper case before hashing like you suggested.

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified" - Donald Knuth

Thorarin
+1 for Knuth's FULL quote. So often we only get the "premature optimization is the root of all evil" part.
Gerry
The Knuth quote is good, but you should have taken it to heart yourself, especially the "after that code has been identified" part - what makes you say it would be slightly faster? I don't think so. And both our opinions are essentially worthless without profiling the code.
mghie
+2  A: 

IMO the whole question is wrong. To quote the Wikipedia article on hash functions:

A hash function is any well-defined procedure or mathematical function which converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array.

Note the "amount of data" - there is no type specified, and indeed the Bob Jenkins hash function has an untyped parameter const Data pointing to the data to be hashed. Since the input data is not necessarily a sequence of characters there is no way to compute a "case-insensitive" hash value. And even if it were a sequence of characters - the upper- or lowercasing would depend on character set and encoding. So you would need to have different hash functions for ASCII strings, UTF-8 encoded strings, UTF-16 LE encoded strings, ... (you get the idea).

mghie
hmm...that seems right...okay then, i will go for the upperCase then...or the unicode version of that
Smasher