ansaurus

Question

What would be a good hashCode for a DateRange class

Answer 1

A:

return startDate.GetHashCode() ^ endDate.GetHashCode();

might be a good start. You have to check that you get good distribution when there is equal distance between startDate and endDate, but different dates.

Albin Sunnanbo 2010-08-19 20:35:03

Bad idea - then *every* date range which starts and ends on the same date will have the same hashcode of 0. (But downvote removed as it felt a little harsh.)

Jon Skeet 2010-08-19 20:37:12

Answer 2

+6 A:

I use this approach from Effective Java for combining hashes:

unchecked
{
    int hash = 17;
    hash = hash * 31 + field1.GetHashCode();
    hash = hash * 31 + field2.GetHashCode();
    ...
    return hash;
}

There's no reason that shouldn't work fine in this situation.

Jon Skeet 2010-08-19 20:37:53

@Jon: that's interesting. so this hash is the best at uniform distribution I take it?

Rob 2010-08-19 20:42:31

I have seen the same principle of multiplying a hash by a prime number and adding another hash used in image manipulation code, so there must be sound reasons for doing it, but something about this has always bugged me. (cont...)

Dour High Arch 2010-08-19 21:10:18

Since hash bits are somewhat uniformly distributed, multiplying one will shift some bits off the high end of the hash container (int in this case). I realize this is unavoidable when adding bits to a fixed size container, but the earlier hashes (field1 in this case) will have more bits shifted out, and all bits of the final hash (field2 in this case) will be retained. Does this not make the technique somewhat unreliable? What if enough hashes are combined to shift all bits of the earlier values out of the container entirely?

Dour High Arch 2010-08-19 21:10:47

@Dour High Arch: I'll confess to not knowing the details of *why* this works... I just have it on pretty reliable authority that it's a pretty good general purpose hash algorithm. I'd be interested to see more analysis though.

Jon Skeet 2010-08-19 22:01:25

@Jon If compiled checked this will throw a System.OverflowException on many likely values. Maybe add an explicit unchecked block?

Jon Hanna 2010-08-19 22:21:11

@Jon: Yup, added.

Jon Skeet 2010-08-20 06:06:57

Answer 3

+1 A:

You have to shift one end of the range, otherwise two equal dates will hash to zero, a pretty common scenario I imagine:

return startDate.GetHashCode() ^ (endDate.GetHashCode() << 4);

Rob 2010-08-19 20:38:01

Shifting by 4 means throwing away 4 bits of information.While probably not problematic with dates (depending on how DateTime.GetHashCode() is implemented), a much better approach is to use multiplication with primes and addition (see Jon Skeet's answer).

Daniel 2010-08-19 20:54:06

@Daniel: thanks. Jon's answer is now part of my utility library.

Rob 2010-08-19 21:32:16

@Daniel, all hashing, unless you can fit all the bits of a representation into the hash size (like in Int16.GetHashCode() for e.g.) throws away some information. There are pros and cons to the shift, prime mult and prime mod methods commonly used. Even shift-rotate (which "brings back" the bits thrown away) still loses info in the + or ^ that will follow.

Jon Hanna 2010-08-19 22:24:49

Answer 4

+4 A:

Well, consider what characteristics a good hash function should have. It must:

be in agreement with Equals - that is, if Equals is true for two objects then the two hash codes have to also be the same.
never crash

And it should:

be very fast
give different results for similar inputs

What I would do is come up with a very simple algorithm; say, taking 16 bits from the hash code of the first and 16 bits from the hash code of the second, and combining them together. Make yourself a test case of representative samples; date ranges that are likely to be actually used, and see if this algorithm does give a good distribution.

A common choice is to xor the two hashes together. This is not necessarily a good idea for this type because it seems likely that someone will want to represent the zero-length range that goes from X to X. If you xor the hashes of two equal DateTimes you always get zero, which seems like a recipe for a lot of hash collisions.

Eric Lippert 2010-08-19 20:41:47

Answer 5

+5 A:

It depends on the values I expect to see it used with.

If it was most often going to have different day values, rather than different times on the same day, and they were within a century of now, I would use:

unchecked
{
    int hash = startDate.Year + endDate.Year - 4007;
    hash *= 367 + startDate.DayOfYear;
    return hash * 367 + endDate.DayOfYear;
}

This distributes the bits well with the expected values, while reducing the number of bits lost in the shifting. Note that while there cases where dependency on primes can be surprisingly bad at collisions (esp. when the hash is fed into something that uses a modulo of the same prime in trying to avoid collisions when producing a yet-smaller hash to distribute among its buckets) I've opted to go for primes above the more obvious choices, as they're only just above and so still pretty "tight" for bit-distribution. I don't worry much about using the same prime twice, as they're so "tight" in this way, but it does hurt if you've a hash-based collection with 367 buckets. This deals well (but not as well) with dates well into the past or future, but is dreadful if the assumption that there will be few or no ranges within the same day (differing in time) is wrong as that information is entirely lost.

If I was expecting (or writing for general use by other parties, and not able to assume otherwise) I'd go for:

int startHash = startDate.GetHashCode();
return (((startHash >> 24) & 0x000000FF) | ((startHash >> 8) & 0x0000FF00) | ((startHash << 8) & 0x00FF0000) | (unchecked((int)((startHash << 24) & 0xFF000000)))) ^ endDate.GetHashCode();

Where the first method works on the assumption that the general-purpose GetHashCode in DateTime isn't as good as we want, this one depends on it being good, but mixes around the bits of one value.

It's good in dealing with the more obvious tricky cases such as the two values being the same, or a common distance from each other (e.g. lots of 1day or 1hour ranges). It's not as good at the cases where the first example works best, but the first one totally sucks if there are lots of ranges using the same day, but different times.

Edit: To give a more detailed response to Dour's concern:

Dour points out, correctly, that some of the answers on this page lose data. The fact is, all of them lose data.

The class defined in the question has 8.96077483×10³⁷ different valid states (or 9.95641648×10³⁶ if we don't care about the DateTimeKind of each date), and the output of GetHashCode has 4294967296 possible states (one of which - zero - is also going to be used as the hashcode of a null value, which may be commonly compared with in real code). Whatever we do, we reduce information by a scale of 2.31815886 × 10²⁷. That's a lot of information we lost!

It's likely true that we can lose more with some than in others. Certainly, it's easy to prove some solutions can lose more than others by writing a valid, but really poor, answer.

(The worse possible valid solution is return 0; which is valid as it never errors or mismatches on equal objects, but as poor as possible as it collides for all values. The performance of a hash-based collection becomes O(n), and slow as O(n) goes, as the constants involved are higher than such O(n) operations as searching an unordered list).

It's difficult to measure just how much is lost. How much more does shifting of some bits before XORing lose than swapping bits, considering that XOR halves the amount of information left. Even the naïve x ^ y doesn't lose more than a swap-and-xor, it just collides more on common values; swap-and-xor will collide on values where plain-xor does not.

Once we've got a choice between solutions that are not losing much more information than possible, but returning 4294967296 or close to 4294967296 possible values with a good distribution between those values, then the question is no longer how much information is lost (the answer that only 4.31376821×10^-28 of the original information remains) but which information is lost.

This is why my first suggestion above ignores time components. There are 864000000000 "ticks" (the 100nanosecond units DateTime has a resolution of) in a day, and I throw away two chunks of those ticks (7.46496×10²³ possible values between the two) on purpose because I'm thinking of a scenario where that information is not used anyway. In this case I've deliberately structured the mechanism in such a way as to pick which information gets lost, that improves the hash for a given situation, but makes it absolutely worthless if we had different values all with start and end dates happening no the same days but at different times.

Likewise x ^ y doesn't lose any more information than any of the others, but the information that it does lose is more likely to be significant than with other choices.

In the absence of any way to predict which information is likely to be of importance (esp. if your class will be public and its hash code used by external code), then we are more restricted in the assumptions we can safely make.

As a whole prime-mult or prime-mod methods are better in which information they lose than shift-based methods, except when the same prime is used in a further hashing that may take place inside a hash-based method, ironically with the same goal in mind (no number is relatively prime to itself! even primes) in which case they are much worse. On the other hand shift-based methods really fall down if fed into a shift-based further hash. There is no perfect hash for arbitrary data and arbitrary use (except when a class has few valid values and we match them all, in which case it's more strictly an encoding than a hash that we produce).

In short, you're going to lose information whatever you do, it's which you lose that's important.

Jon Hanna 2010-08-19 22:14:25

Your second algorithm addresses my concern of more data being lost from some components than from others. Good idea.

Dour High Arch 2010-08-19 23:00:32

@Dour, it has its pros and cons, and I think you're overstating the issue of losing data, as I detail more fully in my editted answer.

Jon Hanna 2010-08-19 23:46:29

@Jon, I understand that discarding data is unavoidable. My concern is that some methods lose different amounts from each piece of data used in the same hash. Consider an iterated version of Rob's hash, shifting 4 bits for each iteration. This will discard all bits of all dates except for the last eight dates considered, will only include the 8 low bits of the eighth date, and so on until it includes all bits of the final date considered. I agree it's "which" data you lose, but it should lose roughly equal amounts from each datum used in the hash; the shift and multiply methods do not do that.

Dour High Arch 2010-08-20 01:17:57

No, shift and multiply **do** lose roughly equal amounts, unless you can see something wrong with the math above. An iterated version of Rob's hash would indeed be a bad idea for the reasons you give, which is a case of what is lost and that it is wasted (if you're going to drop most of your input, don't waste time looking at it). Rob's hash though is not iterated. My byte-swapping approach above doesn't work well iterated over many items either, though I would bet on it beatin Rob's at collision avoidance. If we need to input many items we'd use a modulo approach. Both Rob and I hit close...

Jon Hanna 2010-08-20 01:41:18

... to the total possible amount of information retained, but differ on which it is. Alternatively if you have 3 pieces of data and 2 rarely change and one does wildly, just go ahead and drop the rarely changing 2! It really isn't the amount that matters, esp. once you've hit 2³² possible results already.

Jon Hanna 2010-08-20 01:44:29

ansaurus

tags:

views:

answers:

What would be a good hashCode for a DateRange class

related questions