views:

1667

answers:

2

Note: The following SO questions are related, but neither they nor the linked resources seem to fully answer my questions, particularly in relation to implementing equality tests for collections of objects.


Background

NSObject provides default implementations of -hash (which returns the address of the instance, like (NSUInteger)self) and -isEqual: (which returns NO unless the addresses of the receiver and the parameter are identical). These methods are designed to be overridden as necessary, but the documentation makes it clear that you should provide both or neither. Further, if -isEqual: returns YES for two objects, then the result of -hash for those objects must be the same. If not, problems can ensue when objects that should be the same — such as two string instances for which -compare: returns NSOrderedSame — are added to a Cocoa collection or compared directly.

Context

I develop CHDataStructures.framework, an open-source library of Objective-C data structures. I have implemented a number of collections, and am currently refining and enhancing their functionality. One of the features I want to add is the ability to compare collections for equality with another.

Rather than comparing only memory addresses, these comparisons should consider the objects present in the two collections (including ordering, if applicable). This approach has quite a precedent in Cocoa, and generally uses a separate method, including the following:

I want to make my custom collections robust to tests of equality, so they may safely (and predictably) be added to other collections, and allow others (like an NSSet) to determine whether two collections are equal/equivalent/duplicates.

Problems

An -isEqualTo...: method works great on its own, but classes which define these methods usually also override -isEqual: to invoke [self isEqualTo...:] if the parameter is of the same class (or perhaps subclass) as the receiver, or [super isEqual:] otherwise. This means the class must also define -hash such that it will return the same value for disparate instances that have the same contents.

In addition, Apple's documentation for -hash stipulates the following: (emphasis mine)

"If a mutable object is added to a collection that uses hash values to determine the object's position in the collection, the value returned by the hash method of the object must not change while the object is in the collection. Therefore, either the hash method must not rely on any of the object's internal state information or you must make sure the object's internal state information does not change while the object is in the collection. Thus, for example, a mutable dictionary can be put in a hash table but you must not change it while it is in there. (Note that it can be difficult to know whether or not a given object is in a collection.)"

Edit: I definitely understand why this is necessary and totally agree with the reasoning — I mentioned it here to provide additional context, and skirted the topic of why it's the case for the sake of brevity.

All of my collections are mutable, and the hash will have to consider at least some of the contents, so the only option here is to consider it a programming error to mutate a collection stored in another collection. (My collections all adopt NSCopying, so collections like NSDictionary can successfully make a copy to use as a key, etc.)

It makes sense for me to implement -isEqual: and -hash, since (for example) an indirect user of one of my classes may not know the specific -isEqualTo...: method to call, or even care whether two objects are instances of the same class. They should be able to call -isEqual: or -hash on any variable of type id and get the expected result.

Unlike -isEqual: (which has access to two instances being compared), -hash must return a result "blindly", with access only to the data within a particular instance. Since it can't know what the hash is being used for, the result must be consistent for all possible instances that should be considered equal/identical, and must always agree with -isEqual:. (Edit: This has been debunked by the answers below, and it certainly makes life easier.) Further, writing good hash functions is non-trivial — guaranteeing uniqueness is a challenge, especially when you only have an NSUInteger (32/64 bits) in which to represent it.

Questions

  1. Are there best practices when implementing equality comparisons -hash for collections?
  2. Are there any peculiarities to plan for in Objective-C and Cocoa-esque collections?
  3. Are there any good approaches for unit testing -hash with a reasonable degree of confidence?
  4. Any suggestions on implementing -hash to agree with -isEqual: for collections containing elements of arbitrary types? What pitfalls should I know about? (Edit: Not as problematic as I first thought — as @kperryua points out, "equal -hash values do not imply -isEqual:".)


Edit: I should have clarified that I'm not confused about how to implement -isEqual: or -isEqualTo...: for collections, that's straightforward. I think my confusion stemmed mainly from (mistakenly) thinking that -hash MUST return a different value if -isEqual: returns NO. Having done cryptography in the past, I was thinking that hashes for different values MUST be different. However, the answers below made me realize that a "good" hash function is really about minimizing bucket collisions and chaining for collections that use -hash. While unique hashes are preferable, they are not a strict requirement.

+2  A: 

Two collections should be considered equal if they contain the same elements, and further if the collections are ordered, that the elements are in the same order.

On the subject of hashes for collections, it should be enough to combine the hashes of the elements in some way (XOR them or modulo add them). Note that while the rules state that two objects that are equal according to IsEqual need to return the same hash, the opposite does not hold : Although uniqueness of hashes is desireable, it is not necessary for correctness of the solution. Thus an ordered collection need not take account of the order of the elements.

The excerpt from the Apple documentation is a necessary restriction by the way. An object could not maintain the same hash value under mutation while also ensuring that objects with the same value have the same hash. That applies for the simplest of objects as well as collections. Of course it only usually matters that an object's hash changes when it is inside a container that uses the hash to organise it's elements. The upshot of all this is that mutable collections shouldn't mutate when placed inside another container, but then neither should any object that has a true hash function.

U62
I've edited my question to clarify my prior understanding of your first and last paragraphs. Thanks for pointing out that -hash is not *required* to return different values for disparate values. It didn't make sense at first, but a quick refresher on collections that use hashes reminded me about buckets and chaining. Your suggestions for creating a hash are okay, but @kperryua raises a great point about scalability of hashes, since I'm talking about collections.
Quinn Taylor
+6  A: 

I think trying to come up with some generally useful hash function that will generate unique hash values for collections is an exercise in futility. U62's suggestion of combining the hashes of all the contents will not scale well, as it makes the hash function O(n). Hash functions should really be O(1) to ensure good performance, otherwise the purpose of the hash is defeated. (Consider the common Cocoa construct of plists, which are dictionaries containing arrays and other dictionaries, potentially ad nauseum. Attempting to take the hash of the top-level dictionary of a large plist would be excruciatingly slow if the collections' hash functions were O(n).)

My suggestion would be not to worry a great deal about a collection's hash. As you stated, -isEqual: implies equal -hash values. On the other hand, equal -hash values do not imply -isEqual:. That fact gives you a lot of leeway to create a simple hash.

If you're really worried about collisions though (and you have proof in concrete measurements of real-world situations that confirm it is something to be worried about), you could still follow U62's advice to some degree. For example, you could take the hash of, say, the first and/or last element in the collection, and combine that with, say, the -count of the collection. That be enough to provide a decent hash.

I hope that answers at least one of your questions.

As for No. 1: Implementing -isEqual: is pretty cut and dry. You enumerate the contents, and check isEqual: on each of the elements.

There is one thing to be careful of that may affect what you decide to do for your collections' -hash functions. Clients of your collections must also understand the rules governing -isEqual: and -hash. If you use the contents' -hash in your collection's -hash, your collection will break if the contents' isEqual: and -hash don't agree. It's the client's fault, of course, but that's another argument against basing your -hash off of the collection's contents.

No. 2 is kind of vague. Not sure what you have in mind there.

kperryua
Good answer. The fact that "equal -hash values do *not* imply -isEqual:" vastly simplified the scope of the problem and changed the way I was thinking about it. I am wondering if it makes sense to cache the hash value inside the object, and update it whenever an object is added or removed. This amortizes the cost of calculating the hash, and only updates it when the collection is modified, not each time -hash is called. For the hash itself, of course it's ideal to achieve a consistent distribution, but it's a balance with speed. I'll try the XOR of count, first, and last (if > 1) for now. :-)
Quinn Taylor
With respect to my 2nd question, I guess my concern is about matching the way Foundation collections implement isEqual: and hash. Since the Objective-C counterparts to the Core Foundation collections are closed source, I have no way of knowing if I'm doing it "the Cocoa way". I want to emulate the behavior of existing collections to obey the "Principle of Least Surprise". Also, the equality contracts in Cocoa are slightly different from languages like Java. I've been reading this post for ideas: http://www.karlkraft.com/index.php/2008/01/07/equality-vs-identity/
Quinn Taylor