ansaurus

Question

Answer 1

A:

You mention a few requirements in the paragraph, and I think you might want to be a little more clear on these. So far I gather:

You're building an SQLite backend to basically a dictionary.
You want to allow the keys to be more than basestring type (which types).
You want it to survive the Python 2 -> Python 3 barrier.
You want to support large integers above 2**32 as the key.
Ability to store infinite values (because you don't want hash collisions).

So, are you trying to build a general 'this can do it all' solution, or just trying to solve an immediate problem to continue on within a current project? You should spend a little more time to come up with a clear set of requirements.

Using a hash seemed like the best solution to me, but then you complain that you're going to have multiple rows with the same hash implying you're going to be storing enough values to even worry about the hash.

PKKid 2010-06-03 15:22:09

Ideally I would to be able to use any value which is an acceptable key for a builtin dict, but I would be happy with string, unicode, int and tuple.

Mike Boers 2010-06-03 19:07:21

Answer 2

+1 A:

"Any value which is an acceptable key for a builtin dict" is not feasible: such values include arbitrary instances of classes that don't define __hash__ or comparisons, implicitly using their id for hashing and comparison purposes, and the ids won't be the same even across runs of the very same program (unless those runs are strictly identical in all respects, which is very tricky to arrange -- identical inputs, identical starting times, absolutely identical environment, etc, etc).

For strings, unicodes, ints, and tuples whose items are all of these kinds (including nested tuples), the marshal module could help (within a single version of Python: marshaling code can and does change across versions). E.g.:

>>> marshal.dumps(23)
'i\x17\x00\x00\x00'
>>> marshal.dumps('23')
't\x02\x00\x00\x0023'
>>> marshal.dumps(u'23')
'u\x02\x00\x00\x0023'
>>> marshal.dumps((23,))
'(\x01\x00\x00\x00i\x17\x00\x00\x00'

This is Python 2; Python 3 would be similar (except that all the representation of these byte strings would have a leading b, but that's a cosmetic issue, and of course u'23' becomes invalid syntax and '23' becomes a Unicode string). You can see the general idea: a leading byte represents the type, such as u for Unicode strings, i for integers, ( for tuples; then for containers comes (as a little-endian integer) the number of items followed by the items themselves, and integers are serialized into a little-endian form. marshal is designed to be portable across platforms (for a given version; not across versions) because it's used as the underlying serializations in compiled bytecode files (.pyc or .pyo).

Alex Martelli 2010-06-03 20:37:05

This could work. Just for completeness, assuming we want to treat ints > 2^32 on a 64bit machine the same as longs then we just need to cast those to longs before marshaling to assure it comes out deterministic across those machines.

Mike Boers 2010-06-03 21:52:16

Answer 3

A:

After reading through much of the source (of CPython 2.6.5) for the implementation of repr for the basic types I have concluded (with reasonable confidence) that repr of these types is, in fact, deterministic. But, frankly, this was expected.

I believe that the repr method is susceptible to nearly all of the same cases under which the marshal method would break down (longs > 2**32 can never be an int on a 32bit machine, not guaranteed to not change between versions or interpreters, etc.).

My solution for the time being has been to use the repr method and write a comprehensive test suite to make sure that repr returns the same values on the various platforms I am using.

In the long run the custom repr function would flatten out all platform/implementation differences, but is certainly overkill for the project at hand. I may do this in the future, however.

Mike Boers 2010-06-08 23:25:33

Answer 4

A:

Important note: repr() is not deterministic if a dictionary or set type is embedded in the object you are trying to serialize. The keys could be printed in any order.

For example print repr({'a':1, 'b':2}) might print out as {'a':1, 'b':2} or {'b':2, 'a':1}, depending on how Python decides to manage the keys in the dictionary.

Lee Bush 2010-09-13 16:02:28

This is very true. However, since mutable objects are not usable as keys we don't need to worry about dicts. Frozen sets, on the other hand...

Mike Boers 2010-09-14 11:39:58

ansaurus

tags:

views:

answers:

Deterministic key serialization

Option 1 - Pickling the key

Option 2 - Repr and ast.literal_eval

Option 3: Custom repr

related questions