views:

299

answers:

3

I see quit a few implementations of unique string generation for things like uploaded image names, session IDs, et al, and many of them employ the usage of hashes like SHA1, or others.

I'm not questioning the legitimacy of using custom methods like this, but rather just the reason. If I want a unique string, I just say this:

>>> import uuid
>>> uuid.uuid4()
UUID('07033084-5cfd-4812-90a4-e4d24ffb6e3d')

And I'm done with it. I wasn't very trusting before I read up on uuid, so I did this:

>>> import uuid
>>> s = set()
>>> for i in range(5000000):  # That's 5 million!
>>>     s.add(uuid.uuid4())
...
...
>>> len(s)
5000000

Not one repeater (I wouldn't expect one now considering the odds are like 1.108e+50, but it's comforting to see it in action). You could even half the odds by just making your string by combining 2 uuid4()s.

So, with that said, why do people spend time on random() and other stuff for unique strings, etc? Is there an important security issue or other regarding uuid?

+1  A: 

One possible reason is that you want the unique string to be human-readable. UUIDs just aren't easy to read.

Jason Baker
+2  A: 

Well, sometimes you want collisions. If someone uploads the same exact image twice, maybe you'd rather tell them it's a duplicate rather than just make another copy with a new name.

Ben Voigt
@Ben, Wouldn't you just save the image name as another field in the row, and use programming logic to overwrite the existing image, or say "oops" when they upload the same image again.
orokusaki
His point is still valid: sometimes you want collisions, and GUIDs don't offer them.Having said that, anyone who is using SHA-1 to find a *unique* string is probably doing something wrong, since its output is almost certainly less unique than its input.
ladenedge
@ladenedge I think the SHA1 is part of the equation just to make a more normalized value (in case there are spaces, etc).
orokusaki
@orokusaki: the image name is _generated_, according to the first line of the question. So how is that going to help you identify duplicates, unless it's a hash on the content?
Ben Voigt
@Ben Here's my DB row `[image_name, image_filename, some_other_field, so_on_and_so_on]`. If I get a request to add a new image with an existing `image_name`, I just find the matching `image_filename` and replace that. Who would use the actual image file name for their system of record? I'm developing a multiple tenant architecture, so 5000 clients might have uploaded `logo.jpg`. I wouldn't rely just on having separate folders for each client because then if I change my file system to some cool new S3-like system I don't want to have to create new buckets for each client. That's a nightmare.
orokusaki
+1  A: 

uuids are long, and meaningless (for instance, if you order by uuid, you get a meaningless result).

And, because it's too long, I wouldn't want to put it in a URL or expose it to the user in any shape or form.

hasen j