tags:

views:

112

answers:

3

I'm trying to create a custom object that behaves properly in set operations.

I've generally got it working, but I want to make sure I fully understand the implications. In particular, I'm interested in the behavior when there is additional data in the object that is not included in the equal / hash methods. It seems that in the 'intersection' operation, it returns the set of objects that are being compared to, where the 'union' operations returns the set of objects that are being compared.

To illustrate:

class MyObject:
    def __init__(self,value,meta):
        self.value = value
        self.meta = meta
    def __eq__(self,other):
        return self.value == other.value
    def __hash__(self):
        return hash(self.value)

a = MyObject('1','left')
b = MyObject('1','right')
c = MyObject('2','left')
d = MyObject('2','right')
e = MyObject('3','left')
print a == b # True
print a == c # False

for i in set([a,c,e]).intersection(set([b,d])):
    print "%s %s" % (i.value,i.meta)
#returns:
#1 right
#2 right

 for i in set([a,c,e]).union(set([b,d])):
    print "%s %s" % (i.value,i.meta)
#returns:
#1 left
#3 left
#2 left

Is this behavior documented somewhere and deterministic? If so, what is the governing principle?

+4  A: 

Nope, it's not deterministic. The problem is that you've broken equals' and hash's invariant, that two objects are equivalent when they are equal. Fix your object, don't try to be clever and abuse how set's implementation works. If the meta value is part of MyObject's identity, it should be included in eq and hash.

You can't rely on set's intersection to follow any order, so there is no way to easily do what you want. What you would end up doing is taking the intersection by value only, then look through all your objects for an older one to replace it with, for each one. No nice way to do it algorithmically.

Unions are not so bad:

##fix the eq and hash to work correctly
class MyObject:
    def __init__(self,value,meta):
        self.value = value
        self.meta = meta
    def __eq__(self,other):
        return self.value, self.meta == other.value, other.meta
    def __hash__(self):
        return hash((self.value, self.meta))
    def __repr__(self):
        return "%s %s" % (self.value,self.meta)

a = MyObject('1','left')
b = MyObject('1','right')
c = MyObject('2','left')
d = MyObject('2','right')
e = MyObject('3','left')

union =  set([a,c,e]).union(set([b,d]))
print union
#set([2 left, 2 right, 1 left, 3 left, 1 right])

##sort the objects, so that older objs come before the newer equivalents
sl = sorted(union, key= lambda x: (x.value, x.meta) )
print sl
#[1 left, 1 right, 2 left, 2 right, 3 left]
import itertools
##group the objects by value, groupby needs the objs to be in order to do this
filtered = itertools.groupby(sl, lambda x: x.value)
##make a list of the oldest (first in group)
oldest = [ next(group) for key, group in filtered]
print oldest
#[1 left, 2 left, 3 left]
hlfrk414
Looking at the docs for the __hash__ method, it doesn't appear to indicate that there can be no data in the object that is not hashed. I can think of many examples where 2 objects that are equivalent would have some form of metadata ( a timestamp or a filename, perhaps ) that are different.From the docs for __hash__: The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.
Josh Arenberg
I'm confused by your comment, you seem to be agreeing with me. If an object has metadata (such as a timestamp or filename) that is ignored by eq and hash, then they are not important enough to be kept, either object being compared will do. If they were important enough to distinguish the two objects, they would be included in the hash and eq. What are you asking at this point?
hlfrk414
I'm not not agreeing with you ;). Just trying to understand how these features can be used. In this case, I have objects being created by a monitoring agent. Trying to correlate recurring alert conditions, that have different timestamps. Would prefer to keep the older objects, but of course I can implement it in many other ways, because I suspect you are correct.
Josh Arenberg
+1  A: 

Order doesn't appear to matter:

>>> [ (u.value, u.meta) for u in set([b,d]).intersection(set([a,c,e])) ]
[('1', 'right'), ('2', 'right')]

>>> [ (u.value, u.meta) for u in set([a,c,e]).intersection(set([b,d])) ]
[('1', 'right'), ('2', 'right')]

However, if you do this:

>>> f = MyObject('3', 'right')

And add f to the "right" set:

>>> [ (u.value, u.meta) for u in set([a,c,e]).intersection(set([b,d,f])) ]
[('1', 'right'), ('3', 'right'), ('2', 'right')]

>>> [ (u.value, u.meta) for u in set([b,d,f]).intersection(set([a,c,e])) ]
[('1', 'left'), ('3', 'left'), ('2', 'left')]

So you can see that the behavior depends on the size of the sets (the same effect happens if you union). It may be dependent on other factors as well. I think you're hunting through the python source if you want to know why.

Seth
A: 

Let's say your objects have two different types of attributes: key attributes and data attributes. In your example, MyObject.value is a key attribute.

Store all your objects as values in a dictionary, keyed by the key attributes, making sure that only your preferred (e.g. with the oldest timestamp) are entered in the dictionary. Perform your set operations with the same key as used in the dictionary, and retrieve the actual objects from the dictionary:

result= [dict1[k] for k in set_operation_result]
ΤΖΩΤΖΙΟΥ