tags:

views:

515

answers:

9

We are storing a String key in a HashMap that is a concatenation of three String fields and a boolean field. Problem is duplicate keys can be created if the delimiter appears in the field value.

So to get around this, based on advice in another post, I'm planning on creating a key class which will be used as the HashMap key:

class TheKey {
  public final String k1;
  public final String k2;
  public final String k3;
  public final boolean k4;

  public TheKey(String k1, String k2, String k3, boolean k4) {
    this.k1 = k1; this.k2 = k2; this.k3 = k3; this.k4 = k4;
  }

  public boolean equals(Object o) {
      TheKey other = (TheKey) o;
      //return true if all four fields are equal
  }

  public int hashCode() {
    return ???;  
  }
}

My questions are:

  1. What value should be returned from hashCode(). The map will hold a total of about 30 values. Of those 30, there are about 10 distinct values of k1 (some entries share the same k1 value).
  2. To store this key class as the HashMap key, does one only need to override the equals() and hashCode() methods? Is anything else required?
+8  A: 

Just hashCode and equals should be fine. The hashCode could look something like this:

public int hashCode() {
  int hash = 17;
  hash = hash * 31 + k1.hashCode();
  hash = hash * 31 + k2.hashCode();
  hash = hash * 31 + k3.hashCode();
  hash = hash * 31 + k4 ? 0 : 1;
  return hash;
}

That's assuming none of the keys can be null, of course. Typically you could use 0 as the "logical" hash code for a null reference in the above equation. Two useful methods for compound equality/hash code which needs to deal with nulls:

public static boolean equals(Object o1, Object o2) {
  if (o1 == o2) {
    return true;
  }
  if (o1 == null || o2 == null) {
    return false;
  }
  return o1.equals(o2);
}

public static boolean hashCode(Object o) {
  return o == null ? 0 : o.hashCode();
}

Using the latter method in the hash algorithm at the start of this answer, you'd end up with something like:

public int hashCode() {
  int hash = 17;
  hash = hash * 31 + ObjectUtil.hashCode(k1);
  hash = hash * 31 + ObjectUtil.hashCode(k2);
  hash = hash * 31 + ObjectUtil.hashCode(k3);
  hash = hash * 31 + k4 ? 0 : 1;
  return hash;
}
Jon Skeet
Yeah, you'd want to test the keys for null in the constructor before assigning them to fields.
Tom Hawtin - tackline
@Tom: Maybe - maybe not. It might be entirely valid to have some null elements.
Jon Skeet
+1  A: 

Have you taken a look at the specifications of hashCode()? Perhaps this will give you a better idea of what the function should return.

Paul Lammertsma
Right, I do understand the contract. Just not that familiar on best practices for actually creating the hash code.
Marcus
+2  A: 

The implementation of your hashCode() doesn't matter much unless you make it super stupid. You could very well just return the sum of all the strings hash codes (truncated to an int) but you should make sure you fix this:

  1. If your hash code implementation is slow, consider caching it in the instance. Depending on how long your key objects stick around and how they are used with the hash table when you get things out of it you may not want to spend longer than necessary calculating the same value over and over again. If you stick with Jon's implementation of hashCode() there is probably no need for it as String already cache its hashCode() for you.
    This is however more of a general advice, since the mid 90's I've seen quite a few developers get stung on slow (and even worse, changing) hashCode() implementations.

  2. Don't be sloppy when you create the equals() implementation. Your equals() above will be both ineffective and flawed. First of all you don't need to compare the values if the objects have different hash codes. You should also return false (and not a null pointer exception) if you get a null as the argument.

The rules are simple, this page will walk you through them.

Edit: I have to ask one more thing... You say "Problem is duplicate keys can be created if the delimiter appears in the field value". Why is that? If the format is key+delimiter+key+delimiter+key it really doesn't matter if there are one or more delimiters in the keys unless you get really unlucky with a combination of two keys and in that case you probably should have selected another delimiter (there are quite a few to choose from in unicode).

Anyway, Jon is right in his comment below... Don't do caching "until you've proven it's a good thing". It is a good practice always.

Fredrik
Right. Our fields don't use the delimiter but ideally we don't allow for future bugs cropping up.
Marcus
Here's an example of hypothetically generating dup keys:field1="a-", field2="b-", field3="c" -> key="a--b--c"field1="a", field2="-b", field3="-c" -> key="a--b--c"
Marcus
I wouldn't recommend caching the hash until you've proven it's a good thing. The hash of any one key is usually only calculated once or twice in most situations. It adds memory use and more importantly complexity for probably little or no benefit. Find out whether there's actually any issue before complicating things unnecessarily.
Jon Skeet
@Jon: I almost agree with you... It will add complexity and four bytes of memory use per element but if the sole purpose of the object is to use it in a HashMap, it will most likely get called quite a lot more than once or twice, it is not a typical situation for an object. But of course, one shouldn't do premature optimization and if it is worth it depends both on the implementation he ends up with and the requirements he has.
Fredrik
@Fredrik: Why would it be called more than once? The map itself will calculate the hash of a key when it's added to the map, and then store that hash as part of the map. When a key is used for a query, it will then calculate the hash of *that* once. You'd only benefit from a cache if you were querying using the same key object multiple times - which I find to be a relatively rare use case.
Jon Skeet
@Marcus: Yes, I realize the risk but just as you will see an increased risk if you change the dash delimiter to a c delimiter, there are quite a few non-printable unicode characters (and even ascii ones) that you could probably use instead of over engineering a rather simple problem. Or do you risk having keys with bells and formfeeds or the combination of the two?
Fredrik
@Jon: It depends on his implementation of course but it is quite common that people call hashCode() as a first cheap check in their equals (easily avoided if it is cached btw). The fact that the HashMap happens to calculate its own hash based on the key on insert/get/put doesn't mean it is never used elsewhere. In HashMap, hashCode() is called on a key when it enters the collection (for get/put/remove/whatever) and maybe at rehash (long time since I read the source), but you know that.
Fredrik
@Fredrik: Yes, it *may* be used more than once... but I think it's a stretch to go from that to "it will most likely get called quite a lot more than once or twice." It's *possible*, but it's rare in my experience. I think it's much more common for `hashCode` to be called relatively rarely - rarely enough for the added complexity (and memory cost) of a cache to be more than the benefit.
Jon Skeet
(In particular, your "millions of times" might suggest to the OP that the hash code is computed lots of times within the hash table, either for get or for put. Even if you didn't mean that, the implication is there IMO.)
Jon Skeet
@Jon: I said I almost agree with you and I edited my post to give you cred for it. Removing the comment you seem to have got stuck on would make the rest of the comments rather hard to understand and I think we agree. Another thing that matters is of course if you create new objects for all operations or if you reuse the keys because you get them from somewhere. However, you are totally right in that one should avoid premature optimization.
Fredrik
+6  A: 

In Eclipse you can generate hashCode and equals by Alt-Shift-S h.

starblue
A: 

I do not know if this is an option for you but apache commons library provides an implementation for MultiKeyMap

Wolfgang
+1  A: 

Ask Eclipse 3.5 to create the hashcode and equals methods for you :)

Thorbjørn Ravn Andersen
+2  A: 

this is how a well-formed equals class with equals ans hashCode should look like: (generated with intellij idea, with null checks enabled)

class TheKey {
public final String k1;
public final String k2;
public final String k3;
public final boolean k4;

public TheKey(String k1, String k2, String k3, boolean k4) {
    this.k1 = k1;
    this.k2 = k2;
    this.k3 = k3;
    this.k4 = k4;
}

@Override
public boolean equals(Object o) {
    if (this == o) return true;
    if (o == null || getClass() != o.getClass()) return false;

    TheKey theKey = (TheKey) o;

    if (k4 != theKey.k4) return false;
    if (k1 != null ? !k1.equals(theKey.k1) : theKey.k1 != null) return false;
    if (k2 != null ? !k2.equals(theKey.k2) : theKey.k2 != null) return false;
    if (k3 != null ? !k3.equals(theKey.k3) : theKey.k3 != null) return false;

    return true;
}

@Override
public int hashCode() {
    int result = k1 != null ? k1.hashCode() : 0;
    result = 31 * result + (k2 != null ? k2.hashCode() : 0);
    result = 31 * result + (k3 != null ? k3.hashCode() : 0);
    result = 31 * result + (k4 ? 1 : 0);
    return result;
}
}
Andreas Petersson
+1  A: 

For the hashCode, you could instead use something like

k1.hashCode() ^ k2.hashCode() ^ k3.hashCode() ^ k4.hashCode()

XOR is entropy-preserving, and this incorporates k4's hashCode in a much better way than the previous suggestions. Just having one bit of information from k4 means that if all your composite keys have identical k1, k2, k3 and only differing k4s, your hash codes will all be identical and you'll get a degenerate HashMap.

Steven Schlansker
+1  A: 

I thought your main concern was speed (based on your original post)? Why don't you just make sure you use a separator which does not occur in your (handfull of) field values? Then you can just create String key using concatenation and do away with all this 'key-class' hocus pocus. Smells like serious over-engineering to me.

Adriaan Koster
Point taken. May do that. I just sleep at better knowing that code is bug free and extensible. I'd hate to find out six months from now that there will now be a field value including one of my delimiters..
Marcus