views:

146

answers:

4

in python (3), what is the smallest value that hash(x) can return?

i want to use hashes to give a quick 'fingerprint' to database values (basically making it easy to see whether two longish, similar texts are actually equal or not), and want to get rid of negative numbers (for simplicity), so i thought i'd just add the smallest possible value to obtain values of zero and up. the manual is very helpfully stating "Hash values are integers." which is about as much as i knew before.

i was a bit surprised today when i found that my hand-compiled python on a 64bit ubuntu apparently uses 64 bits or so for its hashing function; i have always thought that should be 32bit. does machine architecture have an impact on the hash() function?

also, when i compiled python, i did not set any option to compile for a 64bit architecture (hoping it would "just work"). does python adjust that by itself or do i now have a 32bit python on a 64bit machine? not a silly question i believe as many times you are offered separate packages depending on the processer.

edit: i strongly suspect the answer will be closely related to sys.maxint which has been sadly removed from python 3. my suspicion is that i should def xhash( x ): return hash( x ) - ( -maxint - 1 ) if maxint was available. i know this value 'lost its value' due to the unification of ints and longs, but here might be one area where it could still prove useful. anybody have an idea how to implement an analogue?

+3  A: 

hash functions usually use the full range of the return value. The reason is that they usually are constructed with bit operations (shifting, xoring, etc) -- the bits in the return value are all used during the algorithm.

Why are positive values easier or harder than negative ones?

Lou Franco
this is only a typographic issue; i want nothing more than get rid of the minus sign. and, yes, negative numbers are harder than positive ones, which is why they got found/invented so much later in history. but my concern is more typographic.
flow
How about formatting them unsigned or in Hex?
Lou Franco
i do present them in hex, but of course they retain the minus sign then
flow
@Lou: just FYI, Python by convention never returns `-1` as a hash value (it signifies an error condition in the C code).
ΤΖΩΤΖΙΟΥ
+2  A: 

hash() can return any integer, and as you have seen, the size of the integer can vary with the architecture. This is one of the reasons dictionary ordering is arbitrary: the same set of operations on two different platforms can give different results because the hashes used along the way can differ.

If all you are doing is showing a hash for a quick fingerprint, then simply keep a subset of the bits. It's still valid as a hash. The only requirement of a hash function is that equal values must have equal hashes. After that, differences among hashes simply affect the efficiency of the algorithms using the hash, because the chances of collision go up or down.

So for example, you could decide you want an 8-digit hash, and get it by using:

hash(x) % 100000000

Or you could get an eight-character alphanumeric hash to display with:

md5(hash(x)).hexdigest()[:8]
Ned Batchelder
If `hash` can return negative values, then `hash(x) % 100000000` can be negative. I quickly looked for an operation `%%` that would satisfy the equation x = (x // y) * y + x %% y but I didn't find it. Does it exist in Python?
Pascal Cuoq
I think you'll find that the Python % operator does what you want. The 2.6 docs claim it satisfies x == (x/y)*y + (x%y). If you find that negative numbers are creeping in, though, you can use abs(hash(x)) % 10000000
Ned Batchelder
A: 

The answer to your question should be:

assert(hash(100) == 100 and hash(-100) == -100)
smallest_hash_value= -2**min(range(256), key=lambda i: hash(-2**i))

This depends on the fact that Python uses the integer itself as a hash (with the exception of -1) iff the integer is a valid hash() result. The algorithm normally should remain the same whatever the architecture.

ΤΖΩΤΖΙΟΥ
i do not seem to grasp what you intend to say with `-2**min(range(256), key=lambda i: hash(-2**i))`; care to explain?
flow
@flow: this evaluates to the smallest `hash()` result that could be produced by your installation of Python.
ΤΖΩΤΖΙΟΥ
oh wah, took me a little but now i get it. the `256` could really be any number greater than the greatest expected word size, right?
flow
@flow: exactly; I was just being over-cautious and wanted my answer's validity to last for at least five years :)
ΤΖΩΤΖΙΟΥ
+1  A: 

so today i was luckier at the google casino, and this is what i found:

(1) system architecture whether a given python is running on a 64 or a 32bit machine can be found by

from platform import architecture
print( architecture() )

from the documentation: "Queries the given executable (defaults to the Python interpreter binary) for various architecture information. Returns a tuple (bits, linkage) which contain information about the bit architecture and the linkage format used for the executable. Both values are returned as strings." on my machine, that's ('64bit', 'ELF'). bingo.

(2) smallest integer there is no sys.maxint in python 3 no more, but there is sys.maxsize. the docs say "An integer giving the maximum value a variable of type Py_ssize_t can take. It’s usually 2**31 - 1 on a 32-bit platform and 2**63 - 1 on a 64-bit platform." therefore,

from sys import maxsize
assert maxsize == 2**63 - 1

works on my machine.

(3) to directly answer the original question: "the smallest value of the hash() function should be minus whatever sys.maxsize reports. for this reason, it can be expected that

def xhash( x ): return hash( x ) + sys.maxsize + 1

will only ever report values ≥ 0."

flow