views:

117

answers:

2

I'd like to create a hashlib instance, update() it, then persist its state in some way. Later, I'd like to recreate the object using this state data, and continue to update() it. Finally, I'd like to get the hexdigest() o the total cumulative run of data. State persistence has to survive across multiple runs.

Example:

import hashlib
m = hashlib.sha1()
m.update('one')
m.update('two')
# somehow, persist the state of m here

#later, possibly in another process
# recreate m from the persisted state
m.update('three')
m.update('four')
print m.hexdigest()
# at this point, m.hexdigest() should be equal to hashlib.sha1().update('onetwothreefour').hextdigets()
+1  A: 

hashlib.sha1 is a wrapper around a C library so you won't be able to pickle it.

It would need to implement the __getstate__ and __setstate__ methods for Python to access its internal state

You could use a pure Python implementation of sha1 if it is fast enough for your requirements

gnibbler
A: 

You can easily build a wrapper object around the hash object which can transparently persist the data.

The obvious drawback is that it needs to retain the hashed data in full in order to restore the state - so depending on the data size you are dealing with, this may not suit your needs. But it should work fine up to some tens of MB.

Unfortunattely the hashlib does not expose the hash algorithms as proper classes, it rathers gives factory functions that construct the hash objects - so we can't properly subclass those without loading reserved symbols - a situation I'd rather avoid. That only means you have to built your wrapper class from the start, which is not such that an overhead from Python anyway.

here is a sample code that might even fill your needs:

import hashlib
from cStringIO import StringIO

class PersistentSha1(object):
    def __init__(self, salt=""):
        self.__setstate__(salt)

    def update(self, data):
        self.__data.write(data)
        self.hash.update(data)

    def __getattr__(self, attr):
        return getattr(self.hash, attr)

    def __setstate__(self, salt=""):
        self.__data = StringIO()
        self.__data.write(salt)
        self.hash = hashlib.sha1(salt)

    def __getstate__(self):
        return self.data

    def _get_data(self):
        self.__data.seek(0)
        return self.__data.read()

    data = property(_get_data, __setstate__)

You can access the "data" member itself to get and set the state straight, or you can use python pickling functions:

>>> a = PersistentSha1()
>>> a
<__main__.PersistentSha1 object at 0xb7d10f0c>
>>> a.update("lixo")
>>> a.data
'lixo'
>>> a.hexdigest()
'6d6332a54574aeb35dcde5cf6a8774f938a65bec'
>>> import pickle
>>> b = pickle.dumps(a)
>>>
>>> c = pickle.loads(b)
>>> c.hexdigest()
'6d6332a54574aeb35dcde5cf6a8774f938a65bec'

>>> c.data
'lixo'
jsbueno
This is a good example of how to biuld a picklable class but storing the data being hashed is a no go, it could be huge. The hash context itself is tiny, but it seems that Python may not expose it.
anthony