tags:

views:

21

answers:

1

For this application I've mine I feel like I can get away with a 40 bit hash key, which seems awfully low, but see if you can confirm my reasoning (I want a small key because I want a small filename and the key will be converted to a filename):

(Note: only accidental collisions a concern - no security issues.)

A key point here is that the population in question is divided into groups, and a collision is only relevant if it occurs within the same group. A "group" is a directory on a user's system (the contents of files are hashed and a collision is only relevant if it occurs for files within the same directory). So with speculating roughly 100,000 potential users, say 2^17, that corresponds to 2^18 "groups" assuming 2 directories per user on average. So with a 40 bit key I can expect 2^(20+9) files created (among all users) before a collision occurs for some user somewhere. (Or IOW 2^((40+18)/2), due to the "birthday effect".) That's an average 4096 unique files created per user, for 2^17 users, before a single collision occurs for some user somewhere. And then that long again before another collision occurs somewhere (right?)

A: 

Your math looks reasonable, but I'm left wondering why you'd bother with this at all. If you want to create unique file names, why not just assign a number to each user, and keep a serial number for that user. When you need a file name, basically just concatentate the user number with the serial number (both padded to the correct number of digits). If you feel that you need to obfuscate those numbers, run that result though a 40-bit encryption (which will guarantee that a unique input produces a unique output).

If, for example, you assign 20 bits to each, you can have 220 users create 220 documents apiece before there's any chance of a collision at all.

If you don't mind serialized access to it, you could just use a single 40-bit counter instead. The advantage of this is that a single user wouldn't immediately use up 220 serial numbers, even though the average user is unlikely to ever create nearly that many documents.

Again, if you think you need to obfuscate this number for some reason, you can use a 40-bit encryption algorithm in counter mode (i.e. use a serial number, but encrypt it) which (again) guarantees that each input maps to a unique output. This guarantees no collision until/unless your users create 240 documents (i.e., the maximum possible with only 40 bits). Alternatively, you could create a 40-bit full-range linear feedback shift register to create your pseudo-random 40-bit numbers. This might be marginally less secure, but has the advantage of being faster and simpler to implement.

Jerry Coffin
The main concern is that if a new file is created (by the software) and its identical to an existing file, it overwrites the existing one rather than creating a new identical file with a different name. Using a hash to create the name gurantees that identical files are given the same name. Probably should have explained that. But if my math looks OK regarding collision probability, that's great. It looked maybe a little facile to me so I just wanted confirmation.
Mark