views:

79

answers:

4

I'm writing a mapping class which persists to the disk. I am currently allowing only str keys but it would be nice if I could use a couple more types: hopefully up to anything that is hashable (ie. same requirements as the builtin dict), but more reasonable I would accept string, unicode, int, and tuples of these types.

To that end I would like to derive a deterministic serialization scheme.

Option 1 - Pickling the key

The first thought I had was to use the pickle (or cPickle) module to serialize the key, but I noticed that the output from pickle and cPickle do not match each other:

>>> import pickle
>>> import cPickle
>>> def dumps(x):
...     print repr(pickle.dumps(x))
...     print repr(cPickle.dumps(x))
... 
>>> dumps(1)
'I1\n.'
'I1\n.'
>>> dumps('hello')
"S'hello'\np0\n."
"S'hello'\np1\n."
>>> dumps((1, 2, 'hello'))
"(I1\nI2\nS'hello'\np0\ntp1\n."
"(I1\nI2\nS'hello'\np1\ntp2\n."

Is there any implementation/protocol combination of pickle which is deterministic for some set of types (e.g. can only use cPickle with protocol 0)?

Option 2 - Repr and ast.literal_eval

Another option is to use repr to dump and ast.literal_eval to load. I have written a function to determine if a given key would survive this process (it is rather conservative on the types it allows):

def is_reprable_key(key):
    return type(key) in (int, str, unicode) or (type(key) == tuple and all(
        is_reprable_key(x) for x in key))

The question for this method is if repr itself is deterministic for the types that I have allowed here. I believe this would not survive the 2/3 version barrier due to the change in str/unicode literals. This also would not work for integers where 2**32 - 1 < x < 2**64 jumping between 32 and 64 bit platforms. Are there any other conditions (ie. do strings serialize differently under different conditions in the same interpreter)? Edit: I'm just trying to understand the conditions that this breaks down, not necessarily overcome them.

Option 3: Custom repr

Another option which is likely overkill is to write my own repr which flattens out the things of repr which I know (or suspect may be) a problem. I just wrote an example here: http://gist.github.com/423945

(If this all fails miserably then I can store the hash of the key along with the pickle of both the key and value, then iterate across rows that have a matching hash looking for one that unpickles to the expected key, but that really does complicate a few other things and I would rather not do it. Edit: it turns out that the builtin hash is not deterministic across platforms either. Scratch that.)

Any insights?

A: 

You mention a few requirements in the paragraph, and I think you might want to be a little more clear on these. So far I gather:

  • You're building an SQLite backend to basically a dictionary.
  • You want to allow the keys to be more than basestring type (which types).
  • You want it to survive the Python 2 -> Python 3 barrier.
  • You want to support large integers above 2**32 as the key.
  • Ability to store infinite values (because you don't want hash collisions).

So, are you trying to build a general 'this can do it all' solution, or just trying to solve an immediate problem to continue on within a current project? You should spend a little more time to come up with a clear set of requirements.

Using a hash seemed like the best solution to me, but then you complain that you're going to have multiple rows with the same hash implying you're going to be storing enough values to even worry about the hash.

PKKid
Ideally I would to be able to use any value which is an acceptable key for a builtin dict, but I would be happy with string, unicode, int and tuple.
Mike Boers
+1  A: 

"Any value which is an acceptable key for a builtin dict" is not feasible: such values include arbitrary instances of classes that don't define __hash__ or comparisons, implicitly using their id for hashing and comparison purposes, and the ids won't be the same even across runs of the very same program (unless those runs are strictly identical in all respects, which is very tricky to arrange -- identical inputs, identical starting times, absolutely identical environment, etc, etc).

For strings, unicodes, ints, and tuples whose items are all of these kinds (including nested tuples), the marshal module could help (within a single version of Python: marshaling code can and does change across versions). E.g.:

>>> marshal.dumps(23)
'i\x17\x00\x00\x00'
>>> marshal.dumps('23')
't\x02\x00\x00\x0023'
>>> marshal.dumps(u'23')
'u\x02\x00\x00\x0023'
>>> marshal.dumps((23,))
'(\x01\x00\x00\x00i\x17\x00\x00\x00'

This is Python 2; Python 3 would be similar (except that all the representation of these byte strings would have a leading b, but that's a cosmetic issue, and of course u'23' becomes invalid syntax and '23' becomes a Unicode string). You can see the general idea: a leading byte represents the type, such as u for Unicode strings, i for integers, ( for tuples; then for containers comes (as a little-endian integer) the number of items followed by the items themselves, and integers are serialized into a little-endian form. marshal is designed to be portable across platforms (for a given version; not across versions) because it's used as the underlying serializations in compiled bytecode files (.pyc or .pyo).

Alex Martelli
This could work. Just for completeness, assuming we want to treat ints > 2^32 on a 64bit machine the same as longs then we just need to cast those to longs before marshaling to assure it comes out deterministic across those machines.
Mike Boers
A: 

After reading through much of the source (of CPython 2.6.5) for the implementation of repr for the basic types I have concluded (with reasonable confidence) that repr of these types is, in fact, deterministic. But, frankly, this was expected.

I believe that the repr method is susceptible to nearly all of the same cases under which the marshal method would break down (longs > 2**32 can never be an int on a 32bit machine, not guaranteed to not change between versions or interpreters, etc.).

My solution for the time being has been to use the repr method and write a comprehensive test suite to make sure that repr returns the same values on the various platforms I am using.

In the long run the custom repr function would flatten out all platform/implementation differences, but is certainly overkill for the project at hand. I may do this in the future, however.

Mike Boers
A: 

Important note: repr() is not deterministic if a dictionary or set type is embedded in the object you are trying to serialize. The keys could be printed in any order.

For example print repr({'a':1, 'b':2}) might print out as {'a':1, 'b':2} or {'b':2, 'a':1}, depending on how Python decides to manage the keys in the dictionary.

Lee Bush
This is very true. However, since mutable objects are not usable as keys we don't need to worry about dicts. Frozen sets, on the other hand...
Mike Boers