tags:

views:

112

answers:

3

I have a class called GraphEdge which I would like to be uniquely defined within a set (the built-in set type) by its tail and head members, which are set via __init__.

If I do not define __hash__, I see the following behaviour:

>>> E = GraphEdge('A', 'B')
>>> H = GraphEdge('A', 'B')
>>> hash(E)
139731804758160
>>> hash(H)
139731804760784
>>> S = set()
>>> S.add(E)
>>> S.add(H)
>>> S
set([('A', 'B'), ('A', 'B')])

The set has no way to know that E and H are the same by my definition, since they have differing hashes (which is what the set type uses to determine uniqueness, to my knowledge), so it adds both as distinct elements. So I define a rather naive hash function for GraphEdge like so:

def __hash__( self ):
    return hash( self.tail ) ^ hash( self.head )

Now the above works as expected:

>>> E = GraphEdge('A', 'B')
>>> H = GraphEdge('A', 'B')
>>> hash(E)
409150083
>>> hash(H)
409150083
>>> S = set()
>>> S.add(E)
>>> S.add(H)
>>> S
set([('A', 'B')])

But clearly, ('A', 'B') and ('B', 'A') in this case will return the same hash, so I would expect that I could not add ('B', 'A') to a set already containing ('A', 'B'). But this is not what happens:

>>> E = GraphEdge('A', 'B')
>>> H = GraphEdge('B', 'A')
>>> hash(E)
409150083
>>> hash(H)
409150083
>>> S = set()
>>> S.add(E)
>>> S.add(H)
>>> S
set([('A', 'B'), ('B', 'A')])

So is the set type using the hash or not? If so, how is the last scenario possible? If not, why doesn't the first scenario (no __hash__ defined) work? Am I missing something?

Edit: For reference for future readers, I already had __eq__ defined (also based on tail and head).

+8  A: 

You have a hash collision. On hash collision, the set uses the == operator to check on whether or not they are truly equal to each other.

Noctis Skytower
Ah, that makes sense. Would it be better practice to simply always return 0 as the hash to force set to use the == operator? I can't see it introducing much more overhead than the hash function itself would otherwise.
ezod
Ensuring that every single object will have a hash collision is guaranteed to kill your performance. Hashes are used to file objects inside the set.
jleedev
@jleedev: Noted, thanks. Using a somewhat better hash function now.
ezod
+3  A: 

It's important to understand how hash and == work together, because both are used by sets. For two values x and y, the important rule is that:

x == y ==> hash(x) == hash(y)

(x equals y implies that x and y's hashes are equal). But, the inverse is not true: two unequal values can have the same hash.

Sets (and dicts) will use the hash to get an approximation to equality, but will use the real equality operation to figure out if two values are the same or not.

Ned Batchelder
+2  A: 

You should always define both __eq__() and __hash__() if you need at least one of them. If the hashes of two objects are equal, an extra __eq__() check is done to verify uniqueness.

Max Shawabkeh