views:

668

answers:

4

The pickle module seems to use string escape characters when pickling; this becomes inefficient e.g. on numpy arrays. Consider the following

z = numpy.zeros(1000, numpy.uint8)
len(z.dumps())
len(cPickle.dumps(z.dumps()))

The lengths are 1133 characters and 4249 characters respectively.

z.dumps() reveals something like "\x00\x00" (actual zeros in string), but pickle seems to be using the string's repr() function, yielding "'\x00\x00'" (zeros being ascii zeros).

i.e. ("0" in z.dumps() == False) and ("0" in cPickle.dumps(z.dumps()) == True)

+8  A: 

Try using a later version of the pickle protocol with the protocol parameter to pickle.dumps(). The default is 0 and is an ASCII text format. Ones greater than 1 (I suggest you use pickle.HIGHEST_PROTOCOL). Protocol formats 1 and 2 (and 3 but that's for py3k) are binary and should be more space conservative.

Benjamin Peterson
+5  A: 

Solution:

import zlib, cPickle

def zdumps(obj):
  return zlib.compress(cPickle.dumps(obj,cPickle.HIGHEST_PROTOCOL),9)

def zloads(zstr):
  return cPickle.loads(zlib.decompress(zstr))  

>>> len(zdumps(z))
128
vartec
A: 

An improvement to vartec's answer, that seems a bit more memory efficient (since it doesn't force everything into a string):

def pickle(fname, obj):
    import cPickle, gzip
    cPickle.dump(obj=obj, file=gzip.open(fname, "wb", compresslevel=3), protocol=2)

def unpickle(fname):
    import cPickle, gzip
    return cPickle.load(gzip.open(fname, "rb"))
gatoatigrado
-1 (1) Don't hard-code protocol numbers, use `-1` or `HIGHEST_PROTOCOL`. (2) Subsequent compression is an ADD-ON and is irrelevant to his question. (3) Specifying `compresslevel` when decompressing is pointless; any such information that may be necessary to decompress the file would be stored in the header of the compressed file -- otherwise how would you be able to decompress a file if you didn't know what compression level was used?
John Machin
(1) Then py2 code won't read py3 objects. (2) the header says "an improvement to vartec's answer", which was using compression -- I think it used less mem, but it could have been a false impression... (3) fixed
gatoatigrado
+2  A: 

z.dumps() is already pickled string i.e., it can be unpickled using pickle.loads():

>>> z = numpy.zeros(1000, numpy.uint8)
>>> s = z.dumps()
>>> a = pickle.loads(s)
>>> all(a == z)
True
J.F. Sebastian