ansaurus

Question

How to hash a large object (dataset) in Python?

Answer 1

+2 A:

What's the format of the data in the arrays? Couldn't you just iterate through the arrays, convert them into a string (via some reproducible means) and then feed that into your hash via update?

e.g.

import hashlib
m = hashlib.md5() # or sha1 etc
for value in array: # array contains the data
    m.update(str(value))

Don't forget though that numpy arrays won't provide __hash__() because they are mutable. So be careful not to modify the arrays after your calculated your hash (as it will no longer be the same).

John Montgomery 2009-04-30 10:17:59

Thanks, your post helped me solve this problem. See below...

2009-04-30 10:34:55

Answer 2

+3 A:

Thanks to John Montgomery I think I have found a solution, and I think it has less overhead than converting every number in possibly huge arrays to strings:

I can create a byte-view of the arrays and use these to update the hash. And somehow this seems to give the same digest as directly updating using the array:

>>> import hashlib
>>> import numpy
>>> a = numpy.random.rand(10, 100)
>>> b = a.view(numpy.uint8)
>>> print a.dtype, b.dtype # a and b have a different data type
float64 uint8
>>> hashlib.sha1(a).hexdigest() # byte view sha1
'794de7b1316b38d989a9040e6e26b9256ca3b5eb'
>>> hashlib.sha1(b).hexdigest() # array sha1
'794de7b1316b38d989a9040e6e26b9256ca3b5eb'

2009-04-30 10:42:19

Will you be able to re-create the object from the cache with this technique? It seems like you will only be able to get an array of type uint8 back (sacrificing the accuracy in your array).

tgray 2009-04-30 16:52:06

Using John Montgomery's solution, it looks like you would get back a float64 array.

tgray 2009-04-30 17:00:20

@tgray: Sometimes it doesn't matter all that much what the accuracy is. Experimental data, especially large ones, tend to have large uncertainties anyways. Obviously this is subject to context, but the general rule is that double precision is important for the calculation, not for storing the data or the final answer.

Tim Lin 2009-04-30 21:27:04

Answer 3

A:

array.data is always hashable, because it's a buffer object. easy :) (unless you care about the difference between differently-shaped arrays with the exact same data, etc.. (ie this is suitable unless shape, byteorder, and other array 'parameters' must also figure into the hash)

2009-08-22 08:29:55

ansaurus

tags:

views:

answers:

How to hash a large object (dataset) in Python?

related questions