views:

334

answers:

3

I want to use unique hashes for each model rather than ids.

I implemented the following function to use it across the board easily.

import random,hashlib
from base64 import urlsafe_b64encode

def set_unique_random_value(model_object,field_name='hash_uuid',length=5,use_sha=True,urlencode=False):
    while 1:
        uuid_number = str(random.random())[2:]
        uuid = hashlib.sha256(uuid_number).hexdigest() if use_sha else uuid_number
        uuid = uuid[:length]
        if urlencode:
            uuid = urlsafe_b64encode(uuid)[:-1]
        hash_id_dict = {field_name:uuid}
        try:
            model_object.__class__.objects.get(**hash_id_dict)
        except model_object.__class__.DoesNotExist:
            setattr(model_object,field_name,uuid)
            return

I'm seeking feedback, how else could I do it? How can I improve it? What is good bad and ugly about it?

A: 

Use your database engine's UUID support instead of making up your own hash. Almost everything beyond SQLite supports them, so there's little reason to not use them.

Ignacio Vazquez-Abrams
+5  A: 

I do not like this bit:

uuid = uuid[:5]

In the best scenario (uuid are uniformly distributed) you will get a collision with probability greater than 0.5 after 1k of elements!

It is because of the birthday problem. In a brief it is proven that the probability of collision exceeds 0.5 when number of elements is larger than square root from number of possible labels.

You have 0xFFFFF=10^6 labels (different numbers) so after a 1000 of generated values you will start having collisions.

Even if you enlarge length to -1 you have still problem here:

str(random.random())[2:]

You will start having collisions after 3 * 10^6 (the same calculations follows).

I think your best bet is to use uuid that is more likely to be unique, here is an example

>>> import uuid
>>> uuid.uuid1().hex
'7e0e52d0386411df81ce001b631bdd31'

Update If you do not trust math just run the following sample to see the collision:

 >>> len(set(hashlib.sha256(str(i)).hexdigest()[:5] for i in range(0,2000)))
 1999 # it should obviously print 2000 if there wasn't any collision
Piotr Czapla
The birthday problem actually applies to random number generation. However, Python's uuid package does not concern random-number generation in specific. Actually, uuid1() from your example is nowhere near random as in cryptographically secure. Just pointing this out in case someone might get the idea to equate Python's uuid package with random number generation.
prometheus
+2  A: 

The ugly:

import random

From the documentation:

This module implements pseudo-random number generators for various distributions.

If anything, please use os.urandom

Return a string of n random bytes suitable for cryptographic use.

This is how I use it in my models:

import os
from binascii import hexlify

def _createId():
    """
    """
    return hexlify(os.urandom(16))

class Book(models.Model):
    """
    """
    id_book = models.CharField(max_length=32, primary_key=True, default=_createId)
prometheus
One think to notice is that urandom is much slower than pseudo random so if you don't need it for cryptographic reason it may not be worth using. On my mac osx it is **21 times** slower.Consider: >>> timeit.Timer('import random; random.random()').timeit(100000) 0.1538231372833252 >>> timeit.Timer('import os; os.urandom(2)').timeit(100000) 3.1858959197998047
Piotr Czapla
I've just checked that uuid is even slower :)
Piotr Czapla