Implementing -hash / -isEqual: / -isEqualTo...: for Objective-C collections

objective-c cocoa data-structures equality chdatastructures

Quinn Taylor · Jul 11, 2009 · Viewed 16.6k times · Source

Note: The following SO questions are related, but neither they nor the linked resources seem to fully answer my questions, particularly in relation to implementing equality tests for collections of objects.

Background

NSObject provides default implementations of -hash (which returns the address of the instance, like (NSUInteger)self) and -isEqual: (which returns NO unless the addresses of the receiver and the parameter are identical). These methods are designed to be overridden as necessary, but the documentation makes it clear that you should provide both or neither. Further, if -isEqual: returns YES for two objects, then the result of -hash for those objects must be the same. If not, problems can ensue when objects that should be the same — such as two string instances for which -compare: returns NSOrderedSame — are added to a Cocoa collection or compared directly.

Context

I develop CHDataStructures.framework, an open-source library of Objective-C data structures. I have implemented a number of collections, and am currently refining and enhancing their functionality. One of the features I want to add is the ability to compare collections for equality with another.

Rather than comparing only memory addresses, these comparisons should consider the objects present in the two collections (including ordering, if applicable). This approach has quite a precedent in Cocoa, and generally uses a separate method, including the following:

I want to make my custom collections robust to tests of equality, so they may safely (and predictably) be added to other collections, and allow others (like an NSSet) to determine whether two collections are equal/equivalent/duplicates.

Problems

An -isEqualTo...: method works great on its own, but classes which define these methods usually also override -isEqual: to invoke [self isEqualTo...:] if the parameter is of the same class (or perhaps subclass) as the receiver, or [super isEqual:] otherwise. This means the class must also define -hash such that it will return the same value for disparate instances that have the same contents.

In addition, Apple's documentation for -hash stipulates the following: (emphasis mine)

"If a mutable object is added to a collection that uses hash values to determine the object's position in the collection, the value returned by the hash method of the object must not change while the object is in the collection. Therefore, either the hash method must not rely on any of the object's internal state information or you must make sure the object's internal state information does not change while the object is in the collection. Thus, for example, a mutable dictionary can be put in a hash table but you must not change it while it is in there. (Note that it can be difficult to know whether or not a given object is in a collection.)"

Edit: I definitely understand why this is necessary and totally agree with the reasoning — I mentioned it here to provide additional context, and skirted the topic of why it's the case for the sake of brevity.

All of my collections are mutable, and the hash will have to consider at least some of the contents, so the only option here is to consider it a programming error to mutate a collection stored in another collection. (My collections all adopt NSCopying, so collections like NSDictionary can successfully make a copy to use as a key, etc.)

It makes sense for me to implement -isEqual: and -hash, since (for example) an indirect user of one of my classes may not know the specific -isEqualTo...: method to call, or even care whether two objects are instances of the same class. They should be able to call -isEqual: or -hash on any variable of type id and get the expected result.

Unlike -isEqual: (which has access to two instances being compared), -hash must return a result "blindly", with access only to the data within a particular instance. ~~Since it can't know what the hash is being used for, the result must be consistent for all possible instances that should be considered equal/identical, and must always agree with -isEqual:~~. (Edit: This has been debunked by the answers below, and it certainly makes life easier.) Further, writing good hash functions is non-trivial — guaranteeing uniqueness is a challenge, especially when you only have an NSUInteger (32/64 bits) in which to represent it.

Questions

Are there best practices when implementing ~~equality comparisons~~ -hash for collections?
Are there any peculiarities to plan for in Objective-C and Cocoa-esque collections?
Are there any good approaches for unit testing -hash with a reasonable degree of confidence?
Any suggestions on implementing -hash to agree with -isEqual: for collections containing elements of arbitrary types? What pitfalls should I know about? (Edit: Not as problematic as I first thought — as @kperryua points out, "equal -hash values do not imply -isEqual:".)

Edit: I should have clarified that I'm not confused about how to implement -isEqual: or -isEqualTo...: for collections, that's straightforward. I think my confusion stemmed mainly from (mistakenly) thinking that -hash MUST return a different value if -isEqual: returns NO. Having done cryptography in the past, I was thinking that hashes for different values MUST be different. However, the answers below made me realize that a "good" hash function is really about minimizing bucket collisions and chaining for collections that use -hash. While unique hashes are preferable, they are not a strict requirement.

Answer

I think trying to come up with some generally useful hash function that will generate unique hash values for collections is an exercise in futility. U62's suggestion of combining the hashes of all the contents will not scale well, as it makes the hash function O(n). Hash functions should really be O(1) to ensure good performance, otherwise the purpose of the hash is defeated. (Consider the common Cocoa construct of plists, which are dictionaries containing arrays and other dictionaries, potentially ad nauseum. Attempting to take the hash of the top-level dictionary of a large plist would be excruciatingly slow if the collections' hash functions were O(n).)

My suggestion would be not to worry a great deal about a collection's hash. As you stated, -isEqual: implies equal -hash values. On the other hand, equal -hash values do not imply -isEqual:. That fact gives you a lot of leeway to create a simple hash.

If you're really worried about collisions though (and you have proof in concrete measurements of real-world situations that confirm it is something to be worried about), you could still follow U62's advice to some degree. For example, you could take the hash of, say, the first and/or last element in the collection, and combine that with, say, the -count of the collection. That be enough to provide a decent hash.

I hope that answers at least one of your questions.

As for No. 1: Implementing -isEqual: is pretty cut and dry. You enumerate the contents, and check isEqual: on each of the elements.

There is one thing to be careful of that may affect what you decide to do for your collections' -hash functions. Clients of your collections must also understand the rules governing -isEqual: and -hash. If you use the contents' -hash in your collection's -hash, your collection will break if the contents' isEqual: and -hash don't agree. It's the client's fault, of course, but that's another argument against basing your -hash off of the collection's contents.

No. 2 is kind of vague. Not sure what you have in mind there.