views:

349

answers:

5

There are many systems that depend on the uniqueness of some particular value. Anything that uses GUIDs comes to mind (eg. the Windows registry or other databases), but also things that create a hash from an object to identify it and thus need this hash to be unique.

A hash table usually doesn't mind if two objects have the same hash because the hashing is just used to break down the objects into categories, so that on lookup, not all objects in the table, but only those objects in the same category (bucket) have to be compared for identity to the searched object.

Other implementations however (seem to) depend on the uniqueness. My example (that's what lead me to asking this) is Mercurial's revision IDs. An entry on the Mercurial mailing list correctly states

The odds of the changeset hash colliding by accident in your first billion commits is basically zero. But we will notice if it happens. And you'll get to be famous as the guy who broke SHA1 by accident.

But even the tiniest probability doesn't mean impossible. Now, I don't want an explanation of why it's totally okay to rely on the uniqueness (this has been discussed here for example). This is very clear to me.

Rather, I'd like to know (maybe by means of examples from your own work):

  • Are there any best practices as to covering these improbable cases anyway?

  • Should they be ignored, because it's more likely that particularly strong solar winds lead to faulty hard disk reads?

  • Should they at least be tested for, if only to fail with a "I give up, you have done the impossible" message to the user?

  • Or should even these cases get handled gracefully?

For me, especially the following are interesting, although they are somewhat touchy-feely:

  • If you don't handle these cases, what do you do against gut feelings that don't listen to probabilities?

  • If you do handle them, how do you justify this work (to yourself and others), considering there are more probable cases you don't handle, like a supernonva?

+1  A: 

The recent OpenSSL vulnerability would probably have been detected much earlier if developers had included some testing code. Obviously it shouldn't attempt to run through all possible sources, but you would get a pretty good idea of the odds if it runs a million iterations without warning. Knowledge is better than faith.

l0b0
+6  A: 
  • If you do handle them, how do you justify this work (to yourself and others), considering there are more probable cases you don't handle, like a supernova?

The answer to that is you aren't testing to spot a GUID collision occurring by chance. You're testing to spot a GUID collision occurring because of a bug in the GUID code, or a precondition that the GUID code relies on that you've violated (or been tricked into violating by some attacker), such as in V1 that MAC addresses are unique and time goes forward. Either is considerably more likely than supernova-based bugs.

However, not every client of the GUID code should be testing its correctness, especially in production code. That's what unit tests are supposed to do, so trade off the cost of missing a bug that your actual use would catch but the unit tests didn't, against the cost of second-guessing your libraries all the time.

Note also that GUIDs only work if everyone who is generating them co-operates. If your app generates the IDs on machines you countrol, then you might not need GUIDs anyway - a locally unique ID like an incrementing counter might do you fine. Obviously Mercurial can't use that, hence it uses hashes, but eventually SHA-1 will fall to an attack that generates collisions (or, even worse, pre-images), and they'll have to change.

If your app generates non-hash "GUIDs" on machines you don't control, like clients, then forget about accidental collisions, you're worried about deliberate collisions by malicious clients trying to DOS your server. Protecting yourself against that will probably protect you against accidents anyway.

  • Or should even these cases get handled gracefully?

The answer to this is probably "no". If you could handle colliding GUIDs gracefully, like a hashtable does, then why bother with GUIDs at all? The whole point of an "identifier" is that if two things have the same ID, then they're the same. If you don't want to treat them the same, just initially direct them into buckets like a hashtable does, then use a different scheme (like a hash).

Steve Jessop
+1 Interesting, I hadn't even considered a bug as reason for collision.
balpha
MAC addresses aren't always unique either, there have been cases where a bunch of cheap knockoffs had the same MAC addresses.
Brad Gilbert
+1 In the vast majority of cases a collision on a 128 bit hash is far more likely to be a bug or attack than an accidental collision.
Aaron Maenpaa
@Brad : I heard of hardware allowing to change the MAC address. In any case, you can configure virtual interfaces with any MAC. If you are running a bunch of virtual machines, all with the same interface setup for whatever sadistic reason, you are going to seed through the same MAC address on different machines.
Stefano Borini
+4  A: 
Aaron Maenpaa
+1 for the nice graph and for helpfully extending it out to Yotta.
Steve B.
You left out "hella-" though. :)
Doug McClean
+1  A: 

Here, for example, is an example where MSFT checking for collisions in the GUID space caused a bug in SQL Server that had to be hotfixed in Windows 2000.

corprew
A: 

Good analysis, but I disagree somewhat with your conclusion. 4 bits are used for the GUID version field and 2 bits are used for the variant field, so you really have only 122 bits. With 2^122, the possibility of GUID collisions still seems low until you consider a very large population of users (say 500,000,000 Facebook users, or 1B Google users) who may be performing actions that are resulting in the creation of new GUIDs at a rate of 1-2 per second. Over the period of just 2 years, your system has only an 86% chance of no collisions, after 4 years you're already down to 50% chance of no collisions. I exaggerated this example so it might seem contrived but it shows that the scale of large websites today (esp. social portals) may be entering a regime where GUIDs are not as useful as everyone thinks. Additionally, if the web moves in the direction of large scale real-time personal data collection and/or continuous streaming of biometric sensor data, this increased rate of GUID generation times some huge number of users means your graph gets shifted even further to the left.

I like the elegance that GUIDs provide, but I predict that in the next decade, for new emerging applications, they will have outlived their utility.

brad2008

brad2008