views:

129

answers:

4

A glance at the source code for string.GetHashCode using Reflector reveals the following (for mscorlib.dll version 4.0):

public override unsafe int GetHashCode()
{
    fixed (char* str = ((char*) this))
    {
        char* chPtr = str;
        int num = 0x15051505;
        int num2 = num;
        int* numPtr = (int*) chPtr;
        for (int i = this.Length; i > 0; i -= 4)
        {
            num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
            if (i <= 2)
            {
                break;
            }
            num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
            numPtr += 2;
        }
        return (num + (num2 * 0x5d588b65));
    }
}

Now, I realize that the implementation of GetHashCode is not specified and is implementation-dependent, so the question "is GetHashCode implemented in the form of X or Y?" is not really answerable. I'm just curious about a few things:

  1. If Reflector has disassembled the DLL correctly and this is the implementation of GetHashCode (in my environment), am I correct in interpreting this code to indicate that a string object, based on this particular implementation, would not cache its hash code?
  2. Assuming the answer is yes, why would this be? It seems to me that the memory cost would be minimal (one more 32-bit integer, a drop in the pond compared to the size of the string itself) whereas the savings would be significant, especially in cases where, e.g., strings are used as keys in a hashtable-based collection like a Dictionary<string, [...]>. And since the string class is immutable, it isn't like the value returned by GetHashCode will ever even change.

What could I be missing?


UPDATE: In response to Andras Zoltan's closing remark:

There's also the point made in Tim's answer(+1 there). If he's right, and I think he is, then there's no guarantee that a string is actually immutable after construction, therefore to cache the result would be wrong.

Whoa, whoa there! This is an interesting point to make (and yes it's very true), but I really doubt that this was taken into consideration in the implementation of GetHashCode. The statement "therefore to cache the result would be wrong" implies to me that the framework's attitude regarding strings is "Well, they're supposed to be immutable, but really if developers want to get sneaky they're mutable so we'll treat them as such." This is definitely not how the framework views strings. It fully relies on their immutability in so many ways (interning of string literals, assignment of all zero-length strings to string.Empty, etc.) that, basically, if you mutate a string, you're writing code whose behavior is entirely undefined and unpredictable.

I guess my point is that for the author(s) of this implementation to worry, "What if this string instance is modified between calls, even though the class as it is publicly exposed is immutable?" would be like for someone planning a casual outdoor BBQ to think to him-/herself, "What if someone brings an atomic bomb to the party?" Look, if someone brings an atom bomb, party's over.

+7  A: 

Obvious potential answer: because that will cost memory.

There's a cost/benefit analysis here:

Cost: 4 bytes for every string (and a quick test on each call to GetHashCode). Also make the string object mutable, which would obviously mean you'd need to be careful about the implementation.

Benefit: Avoid recomputing the hash for string values hashed more than once

I would suggest that in many cases, there are many, many string objects and very few of them are hashed more than once - leading to a net cost. For some cases, obviously that won't be the case.

I don't think I'm in a good position to judge which comes up more often... I would hope that MS has instrumented various real apps. (I'd also hope that Sun did the same for Java, which does cache the hash...)

EDIT: I've just spoken to Eric Lippert about this (NDC is awesome :) and basically it is about the extra memory hit vs the limited benefits.

Jon Skeet
@Jon - I don't personally think that the memory is the issue here. I think that it's partly to do with the fact that string.GetHashCode is rarely called - Dictionary, HashSet et al will use a different implementation (see my answer), that the concept of a de-facto hash-code for a string is actually nonsensical unless you always see it as a byte array, plus the fact that via the code that Tim posted you can modify the underlying memory buffer of the string, thus rendering a stored hash code useless.
Andras Zoltan
@Andras: I don't see it as a byte array - I see it as a char array, basically. I think if someone mutates the memory, they're on dangerous ground to start with... so my guess (and that's all it is) is still on memory. Dictionary etc *can* use a different implementation, but I don't believe they will by default, which is what most people are likely to use.
Jon Skeet
At this point it would be hard *not* to acknowledge this as the authoritative answer. I might make a few more remarks on this later, but for now I'm just going to go ahead and accept it. Thanks for the really helpful insight.
Dan Tao
+3  A: 

Firstly - there's no knowing if caching this result would actually improve Dictionary<string, ...> et al because they don't necessarily use String.GetHashCode, because it uses an IComparer to get the hashcode for a string.

And if you follow the likely call chain for the StringComparer class, it ends up going through to the System.Globalization.CompareInfo class, which finally terminates at this method:

[SecurityCritical, SuppressUnmanagedCodeSecurity, DllImport("QCall",
   CharSet=CharSet.Unicode)]
private static extern int InternalGetGlobalizedHashCode(IntPtr handle, string
   localeName, string source, int length, int dwFlags);

There's no knowing if that library - which appears to be a native method - doesn't use some form of internal caching based on the underlying .Net object data structure that we can't get at once inside the .Net runtime.

However, the important thing to note with this is that one string can have many different hash codes based on how you chose to interpret the characters. Granted, this implementation is culture-inspecific - which is why it's unsuitable for these comparers.

So, whilst the additional memory storage could be a factor, I actually think it's because to store a hash code along with an instance of the string misleads the caller, and indeed the .Net internal dev team(!), into thinking that the string only has one hash code, when in fact it entirely depends on how you're going to interpret it - as a series of bytes (which most of us do not), or as a series of printable characters.

From a performance point of view, then, if we also accept that these comparers used by Dictionary<,> etc can't be using the internal implementation, not caching this result probably doesn't have much of an impact because, frankly, how often will this method actually get called in the real world: since most of the time a hashcode of a string is most likely calculated via some other mechanism.

EDIT

There's also the point made in Tim's answer(+1 there). If he's right, and I think he is, then there's no guarantee that a string is actually immutable after construction, therefore to cache the result would be wrong.

AN ADDITIONAL EDIT(!)

Dan makes the point that strings are meant to be immutable within the Net sphere and therefore that string should be free to cache it's own hashcode based on this. The problem here is that the .Net framework also provides a legitimate way to change the supposedly immutable string that does not involve privileged reflection or anything else. It's a fundamental problem with strings, it's a pointer to a buffer that you cannot control. Never mind in the C# world, what about in C++, where vectoring over and modifying memory buffers is common-place. Just because you ideally shouldn't do it doesn't mean that the framework should expect you not to.

.Net happens to provide this functionality, and therefore if this was a design decision by the .Net team in response to the kind of binary thuggery suggested by Tim, then they were very wise to have taken it into account. Whether they did, or whether it is by fluke, is another matter entirely! :)

Andras Zoltan
I knew there was something I just wasn't thinking of, and I think you nailed it here: the hashtable-based classes in .NET don't necessarily use `TKey.GetHashCode`; they use the `IComparer<TKey>.GetHashCode` of their internal comparers. That makes sense!
Dan Tao
In response to your edit: Take a look at my update at the bottom of the question. (Not saying it isn't an interesting point, of course! I just don't think that could be part of the *reason* for the implementation of `GetHashCode`.)
Dan Tao
@Dan Tao: Have updated again - while I think you have a point about 'shouldn't modify it', the point is that .Net framework gives the developer the ability to do so within legal bounds - and therefore the .Net team would have been right to have taken that into account.
Andras Zoltan
@Andras: You make your point well, but I still have to disagree. What is the purpose of having a hash code in the first place? The obvious answer that immediately comes to mind is: to facilitate using an object as a key in a hashtable. But working as a key *also* requires a valid `Equals` method; and if we're mutating our keys, suddenly all of our hashtable operations aren't going to work anymore.
Dan Tao
@Dan Tao: Yes, but I don't see the problem here (apart from the fact that we both agree modifying the string is a bad move!). If a developer is dumb enough to modify a string in this way and then try and use it as a key to retrieve something originally added by that string (the hash table won't keep the string reference, only the resulting hash), he/she won't get a hit and then they'll be left scratching their head.
Andras Zoltan
@Andras: I feel like you're being a little inconsistent, though. First you say the designers were right to consider the potentiality for string mutation in their `GetHashCode` implementation; but then you say that a developer who would mutate a string used in a hashtable deserves no mercy. So for whose benefit was it decided not to cache the hash code, then? Not for a "smart" developer who wouldn't ever mutate a string, nor for a "dumb" developer who would. Also, I'm pretty sure a hashtable *does* need to keep a reference to the string; otherwise, how could it check if a key exists?
Dan Tao
@Dan Tao: Fair enough - thought I'd take a look at HashTable... Guess what, it also uses IEqualityComparer - so that's null and void again. If this decision was indeed taken at all (rather than being an oversight, which I highly doubt) then it's not for the developers' benefit at all. It's for the benefit of the framework such that it behaves consistently in the face of potentially inconsistent usage by a developer which; I say again, the framework also allows he or she to do. The fact that we can't come up with a good reason why you('d) should(/want) (to) mutate string is not important
Andras Zoltan
+5  A: 

I may have made a wrong conclusion here, but isn't it true that while the string is immutable in the context of a .NET String object, it's still possible to change the value?

For instance, if you were so inclined to do this...

String example = "Hello World";

unsafe
{
    fixed (char* strPointer = myString) {
        strPointer[1] = 'a';
    }
} 

...wouldn't example still represent the same String object, but now with a value that would compute a different value for GetHashCode()? I may be off-base here, but since you could easily (if not pointlessly) do this, that would cause some issues as well.

Tim Stone
+1 - a very good point. It might be hideously bad practise - but you could certainly do it!
Andras Zoltan
@Tim: Yes, you could do this, and in fact [I've written an entire blog post about how this is possible](http://philosopherdeveloper.wordpress.com/2010/05/28/are-strings-really-immutable-in-net/); but the fact is that the framework *relies* on you **not** doing this, for many, many reasons. As just one example, consider string interning of literals. If you type `Console.WriteLine("Hello World")` after the code you've posted it will output `Hallo World`. Since the framework relies so heavily on immutable strings, I doubt the designers would've bothered themselves with this possibility.
Dan Tao
@Dan: That's definitely a fair argument, given that what I wrote is certainly not something I'd ever expect anyone to do in actual .NET code. Like the other answers mention though, I feel like it's ultimately a case of the arguments against just stacking up, whether they're just theoretical issues or otherwise. As a side note, that was an enjoyable blog post.
Tim Stone
@Tim: Good point. I certainly agree with you about the arguments stacking up against. Ultimately, as I imagine generally happens whenever one questions decisions made by the .NET designers on low-level, widely-used code like this (at least when *I* do, anyway), it looks like -- not surprisingly -- there's a *lot* more to think about than I originally considered.
Dan Tao
+1  A: 

One more potential reason for this is that interned strings (specifically those that are added as shared readonly data by the compiler) can have exactly the same format as any other string. The fact that these strings are loaded into readonly memory means that those data pages can be shared easily across process, but that the it would not be possible to also have them cache a hashcode.

But as others have mentioned, the primary reason for not caching the value is that the additional memory usage is likely to far outweigh the potential savings of hashcode caching. The execution time of GetHashCode is O(N) on the length of the string so the worst case scenario of repeated hashing is well bounded.

StarryNight