views:

903

answers:

3

I would like to calculate a hash of a Python class containing a dataset for Machine Learning. The hash is meant to be used for caching, so I was thinking of md5 or sha1. The problem is that most of the data is stored in NumPy arrays; these do not provide a __hash__() member. Currently I do a pickle.dumps() for each member and calculate a hash based on these strings. However, I found the following links indicating that the same object could lead to different serialization strings:

What would be the best method to calculate a hash for a Python class containing Numpy arrays?

+2  A: 

What's the format of the data in the arrays? Couldn't you just iterate through the arrays, convert them into a string (via some reproducible means) and then feed that into your hash via update?

e.g.

import hashlib
m = hashlib.md5() # or sha1 etc
for value in array: # array contains the data
    m.update(str(value))

Don't forget though that numpy arrays won't provide __hash__() because they are mutable. So be careful not to modify the arrays after your calculated your hash (as it will no longer be the same).

John Montgomery
Thanks, your post helped me solve this problem. See below...
+3  A: 

Thanks to John Montgomery I think I have found a solution, and I think it has less overhead than converting every number in possibly huge arrays to strings:

I can create a byte-view of the arrays and use these to update the hash. And somehow this seems to give the same digest as directly updating using the array:

>>> import hashlib
>>> import numpy
>>> a = numpy.random.rand(10, 100)
>>> b = a.view(numpy.uint8)
>>> print a.dtype, b.dtype # a and b have a different data type
float64 uint8
>>> hashlib.sha1(a).hexdigest() # byte view sha1
'794de7b1316b38d989a9040e6e26b9256ca3b5eb'
>>> hashlib.sha1(b).hexdigest() # array sha1
'794de7b1316b38d989a9040e6e26b9256ca3b5eb'
Will you be able to re-create the object from the cache with this technique? It seems like you will only be able to get an array of type uint8 back (sacrificing the accuracy in your array).
tgray
Using John Montgomery's solution, it looks like you would get back a float64 array.
tgray
@tgray: Sometimes it doesn't matter all that much what the accuracy is. Experimental data, especially large ones, tend to have large uncertainties anyways. Obviously this is subject to context, but the general rule is that double precision is important for the calculation, not for storing the data or the final answer.
Tim Lin
A: 

array.data is always hashable, because it's a buffer object. easy :) (unless you care about the difference between differently-shaped arrays with the exact same data, etc.. (ie this is suitable unless shape, byteorder, and other array 'parameters' must also figure into the hash)