views:

10175

answers:

5

Windows XP, Python 2.5:

hash('http://stackoverflow.com') Result: 1934711907

Google App Engine (http://shell.appspot.com/):

hash('http://stackoverflow.com') Result: -5768830964305142685

Why is that? How can I have a hash function which will give me same results across different platforms (Windows, Linux, Mac)?

+6  A: 

use hashlib

SilentGhost
A: 

It probably just asks the operating system provided function, rather than its own algorithm.

As other comments says, use hashlib or write your own hash function.

ewanm89
+20  A: 

As stated in the documentation, built-in hash() function is not designed for storing resulting hashes somewhere externally. It is used to provide object's hash value, to store them in dictionaries and so on. It's also implementation-specific (GAE uses a modified version of Python). Check out:

>>> class Foo:
...     pass
... 
>>> a = Foo()
>>> b = Foo()
>>> hash(a), hash(b)
(-1210747828, -1210747892)

As you can see, they are different, as hash() uses object's hash method instead of 'normal' hashing algorithms, such as SHA.

Given the above, the rational choice is to use the hashlib module.

Mike Hordecki
Thank you! I came here wondering why I would always get different hash values for identical objects resulting in unexpected behaviour with dicts (which index by hash+type rather than checking for equality). A quick way to generate your own int hash from hashlib.md5 is `int(hashlib.md5(repr(self)).hexdigest(), 16)` (assuming that `self.__repr__` has been defined to be identical iff objects are identical). If 32 bytes are too long, you can cut of course the size down by slicing the hex string prior to conversion.
Alan
On second thought, if `__repr__` is unique enough, you could just use `str.__hash__` (i.e. `hash(repr(self))`) as dicts don't mix up non-equal objects with the same hash. This only works if the object is trivial enough that the repr can represent identity, obviously.
Alan
+2  A: 

At a guess, AppEngine is using a 64-bit implementation of Python (-5768830964305142685 won't fit in 32 bits) and your implementation of Python is 32 bits. You can't rely on object hashes being meaningfully comparable between different implementations.

George V. Reilly
A: 

The response is absolutely no surprise: in fact

In [1]: -5768830964305142685L & 0xffffffff
Out[1]: 1934711907L

so if you want to get reliable responses on ASCII strings, just get the lower 32 bits as uint. The hash function for strings is 32-bit-safe and almost portable.

On the other side, you can't rely at all on getting the hash() of any object over which you haven't explicitly defined the __hash__ method to be invariant.

Over ASCII strings it works just because the hash is calculated on the single characters forming the string, like the following:

class string:
    def __hash__(self):
        if not self:
            return 0 # empty
        value = ord(self[0]) << 7
        for char in self:
            value = c_mul(1000003, value) ^ ord(char)
        value = value ^ len(self)
        if value == -1:
            value = -2
        return value

where the c_mul function is the "cyclic" multiplication (without overflow) as in C.

saverio