tags:

views:

696

answers:

12

I am getting a string hash like this:

string content = "a very long string";
int contentHash = content.GetHashCode();

I am then storing the hash into a dictionary as key mapping to another ID. This is useful so I don't have to compare big strings during default dictionary hash computation but I can just fish the ID from the dictionary by key.

Can I be sure that the hash for a given string ("a very long string") will be always the same?

Can I be sure that two different strings won't have the same hash?

Also, if possible, how likely is it to get the same hash for different strings?

+1  A: 

Strings are hashed based on their content, so yes, that hash should remain the same over time if you use the default GetHashCode.

John Källén
+4  A: 

Yes it will, that's the purpose of a hash code! It's not guaranteed to be the same between different versions of the runtime tho. More info on MSDN

MrWiggles
I was reading somewhere that "sometimes it could change" but not too sure where
JohnIdol
But be aware that multiple strings can map to the same hash code!
peSHIr
Good point peSHIr, in my rush to get the answer out I forgot about that.
MrWiggles
Yes it can change between implementations so don't rely on it when exchanging data between different systems. Inside a single system, it will stay constant.
Think Before Coding
so I cannot rely on this to understand if I am dealing with "the same" string
JohnIdol
as long as you are using the same runtime then yes. if you generate a hashcode in 1.1 and then move to 2 tho, there is no guarantee there
MrWiggles
I am using the same runtime but my concern is really that two different string might have the same hash so I am not sure it's the same looking at the hash -- how is this commonly implemented?
JohnIdol
+3  A: 

The documentation for Object.GetHashCode states

If two objects compare as equal, the GetHashCode method for each object must return the same value.

So you are guaranteed that the hash code will be the same for a given string. However, you aren't guaranteed that it will be unique (there may be other strings that have the same hash code).

Ben Lings
so I cannot rely on this to understand if I am dealing with "the same" string
JohnIdol
+10  A: 

Yes, it will be consistent since strings are immutable. However, I think you're misusing the dictionary. You should let the dictionary take the hash of the string for you by using the string as the key. Hashes are not guaranteed to be unique, so you may overwrite one key with another.

HTH, Kent

Kent Boogaart
I am checking if the key is contained before adding - I need to map the hash to another ID - the I retrieve by the hash and use the ID for something else, that's why I am using a dictionary
JohnIdol
The misuse is creating the hashcode manually and using that as a key, rather than using the string itself as a key. This results in unnecessarily complex code and you asking a question here when most likely the performance problem you want to avoid exists only in your imagination.
Michael Borgwardt
this is a part of a pretty heavy process so the performance will matter - anyway, are you suggesting to use the string as the key no matter how big?
JohnIdol
Yes. And "performance will matter" is not the same as "this specific piece of code will be performance-critical". I'll bet you 3 to 1 that your program will spend most of its CPU time somewhere else that you didn't even think of.
Michael Borgwardt
You are probably right. I will try with hash as key then with string as key just out of curiosity though.
JohnIdol
+1  A: 

As has already been mentioned you can be sure that a hash for a partiular string will be the same as they are hashed based on content. However you cannot be sure that a particular string will be hashed the same for later versions of the .NET framework as is mentioned here

So I would say that this method is fine if it is being used internally to an application. If you are persisting the value to a data store then it is probably best to roll your own function to ensure that it remains consistent across versions.

Jack Ryan
+2  A: 

As others pointed out, the hash will remain constant over time. But why are you hashing a string and then put it as key on a Dictionary? Hashes are not guaranteed to be unique. So you comparisons might be incorrect. Let the Dictionary do it's job. I think the most appropriate collection to this case is a HashSet.

bruno conde
I need to map the hash to another ID - the I retrieve by the hash and use the ID for something else, that's why I am using a dictionary - how would I do that with an hashtable?
JohnIdol
In that case you need to use a Dictionary.
bruno conde
only concern now is the fact that different strings could have the same hash
JohnIdol
+3  A: 

Just to add some detail as to where the idea of a changing hashcode may have come from.

As the other answers have rightly said the hashcode for a specific string will always be the same for a specific runtime version. There is no guarantee that a newer runtime might use a different algorithm perhaps for performance reasons.

The String class overrides the default GetHashCode implementation in object.

The default implementation for a reference type in .NET is to allocate a sequential ID (held internally by .NET) and assign it to the object (the objects heap storage has slot for storing this hashcode, it only assigned on the first call to GetHashCode for that object).

Hence creating an instance of a class, assigning it some values then retrieving the hashcode, followed by doing the exact same sequence with the same set of values will yeild different hashcodes. This may be the reason why some have been led to believe that hashcodes can change. In fact though its the instance of a class which is allocated a hashcode once allocated that hashcode does not change for that instance.

Edit: I've just noticed that none of the answers directly reference each of you questions (although I think the answer to them is clear) but just to tidy up:-

Can I be sure that the hash for a given string ("a very long string") will be always the same?

In your usage, yes.

Can I be sure that two different strings won't have the same hash?

No. Two different strings may have the same hash.

Also, if possible, how likely is it to get the same hash for different strings?

The probability is quite low, resulting hash is pretty random from a 4G domain.

AnthonyWJones
thanks - I do not understand how people can use hash to store passwords if then it's not possible to use the hash to check if that's the same password (could be another password with the same hash!)
JohnIdol
Yes, there could. But it's unlikely enough that it simply does not matter.
Michael Borgwardt
You wouldn't use the inbuilt hashing algorithm for hashing passwords, you would use an industry standard one like SHA-512
MrWiggles
The value stored for a password hash will use many bits than the the hashcode returned by GetHashCode. With a wider hashcode the probability of a collision is almost negligable.
AnthonyWJones
A: 

This is a great example for the evils of premature optimization.

Do you have the output of a profiler or benchmark that tells you String comparison between entries in the same hash bucket is actually causing a performance problem?

Didn't think so. Just use the string itself as a key in the Dictionary. That's how you're supposed to use it.

BTW, there are far, far more different strings than there are different int, so basic logic tells you that it's impossible to have a different hashcode for each different string.

Michael Borgwardt
wish I was so smart
JohnIdol
One of the rare examples where the higher math that CS at university forces on you actually is useful - concepts like injective functions become ingrained.
Michael Borgwardt
forgot everything about that - I was just looking at this link and the performance impact doesn't seem reasonable since in my case I am dealing with huge strings (200 characters as a bare minimum)
JohnIdol
link: http://dotnetperls.com/Content/Dictionary-String-Key.aspx
JohnIdol
That still doesn't mean it will have any kind of impact on your system. Again: first make it work correctly, THEN see if performance is not satisfying, THEN see where the actual performance bottlenecks are.
Michael Borgwardt
+1  A: 

Can I be sure that the hash for a given string ("a very long string") will be always the same?

Yes

Can I be sure that two different strings won't have the same hash?

No

Michał Piaskowski
+1  A: 

You don't have to guess about run-times or versions, just use this CaseInsensitiveStringComparer class that I made in my spare time (you can pass it to the constructor of the dictionary or if you are using .NET 3.5, a HashSet):

/// <summary>
/// StringComparer that is basically the same as StringComparer.OrdinalIgnoreCase, except that the hash code function is improved and guaranteed not to change.
/// </summary>
public class CaseInsensitiveStringComparer : StringComparer
{
    /// <summary>
    /// Compares two strings, ignoring case
    /// </summary>
    /// <param name="x">First string</param>
    /// <param name="y">Second string</param>
    /// <returns>Compare result</returns>
    public override int Compare(string x, string y)
    {
        return StringComparer.OrdinalIgnoreCase.Compare(x, y);
    }

    /// <summary>
    /// Checks if two strings are equal, ignoring case
    /// </summary>
    /// <param name="x">First string</param>
    /// <param name="y">Second string</param>
    /// <returns>True if strings are equal, false if not</returns>
    public override bool Equals(string x, string y)
    {
        return Compare(x, y) == 0;
    }

    /// <summary>
    /// Gets a hash code for a string, ignoring case
    /// </summary>
    /// <param name="obj">String to get hash code for</param>
    /// <returns>Hash code</returns>
    public override int GetHashCode(string obj)
    {
        if (obj == null)
        {
            return 0;
        }
        int hashCode = 5381;
        char c;
        for (int i = 0; i < obj.Length; i++)
        {
            c = obj[i];
            if (char.IsLower(c))
            {
                c = char.ToUpperInvariant(c);
            }
            hashCode = ((hashCode << 5) + hashCode) + c;
        }
        return hashCode;
    }
}
John JJ Curtis
+1  A: 

Given that there are an infinite number of different strings its just not possible to allocate a different int (32bits which can represent up to 4 billion) number for each.

With just 8 characters tehre are 2^60 different strings. This is infinitely larger than 2^32. Naturally the hashcode of some of these strings must clash.

Two objects with the same hashcode do not have to be equal. To know for sure use the equals method. This is basically the strategy used by a hashmap to determine if keys are equal.

Map.get(String key)

  • Calculate hashcode of key
  • Use modulo to figure out which bucket key belongs too.
  • Loop thru all the entries from that bucket attempting to find a matching key.
  • When a key match is found return that entries' value.

As a side note as maps gain more and more elements it will recreate more buckets and place all the old entries into the new buckets. This helps present the bucket entry list from growing into really really long lists. A map wants many buckets with short lists.

The javadoc for Object.hashcode makes for interesting reading - ive pasted a snippet below.

 The equals method implements an equivalence relation:

* It is reflexive: for any reference value x, x.equals(x) should return true.
* It is symmetric: for any reference values x and y, x.equals(y) should return true if and only if y.equals(x) returns true.
* It is transitive: for any reference values x, y, and z, if x.equals(y) returns true and y.equals(z) returns true, then x.equals(z) should return true.
* It is consistent: for any reference values x and y, multiple invocations of x.equals(y) consistently return true or consistently return false, provided no information used in equals comparisons on the object is modified.
* For any non-null reference value x, x.equals(null) should return false.

The equals method for class Object implements the most discriminating possible equivalence relation on objects; that is, for any reference values x and y, this method returns true if and only if x and y refer to the same object (x==y has the value true).

mP
+3  A: 

As many others have said, the implementation is dependent on the version of the framework but it also depends on the architecture. The implementation of string.GetHashCode() is dfferent in the x86 and x64 versions of the framework even if they have the same version number.

For example, if you are writing a client/server or .net remoting type of architecture and want to use a string HashCode to stop from downloading a large resource, you can only do this if both are the same version and bitness. Otherwise you should use a different hash -- MD5, SHA etc will work correctly.

Jack Bolding
+1 This just bit me last night when rolling out a new server... Thanks for the useful info.
Aaron