views:

394

answers:

2

Hi Guys, I have a database where 2 (or maybe 3 or 4) different applications are inserting information. The new information has IDs of the type GUID/UUID, but each application is using a different algorithm to generate the IDs. For example, one is using the NHibernate's "guid.comb", other is using the SQLServer's NEWID(), other might want to use .NET's Guid.NewGuid() implementation.

Is there an above normal risk of ID collision or duplicates?

Thanks!

+1  A: 

Yes, the risk is above normal, because all of these use different definitions of "GUID." Guid.NewGuid() is an RFC-compliant mostly-random GUID, but NEWSEQUENTIALID is a reordered (and therefore non-RFC-compliant) GUID based on MAC address and timestamp, and NHibernate's comb GUID is completely different (based on randomness and timestamp).

You may want to consider just standardizing on one GUID implementation. I use my own type of combed GUID for all my apps. My blog has brief descriptions of all these types of GUIDs along with design decisions for my own.

Stephen Cleary
+5  A: 

The risk of collisions is elevated slightly but still vanishingly small. Consider that:

  • Both Comb and NEWID/NEWSEQUENTIALID include a timestamp with precision down to a few ms. Thus, unless you are generating a large number of IDs at the exact same moment time from all of these different sources, it is literally impossible for IDs to collide.

  • The part of the GUID that isn't based on the timestamp can be thought of as random; most GUID algorithms base these digits on a PRNG. Thus, the likelihood of a collision between these other 10 bytes or so is on the same order as if you used two separate random number generators and watched for collisions.

    Think about this for a moment - PRNGs can and do repeat numbers, so the likelihood of a collision between two of them isn't significantly higher than a collision using just one of them, even if they use slightly different algorithms. It's sort of like playing the same lottery numbers every week vs. picking a random set every week - the odds of winning are exactly the same either way.

Now, keep in mind that when you use an algorithm like Guid.Comb, you only have 10 bytes of uniqueifier, which equates to 1024 separate values. So if you're generating a huge number of GUIDs within the same few milliseconds, you will get collisions. But if you generate GUIDs at a fairly low frequency, it doesn't really matter how many different algorithms you use at the same time, the likelihood of a collision is still practically nonexistent.

The best way for you to be absolutely certain is to run a test; have all 2 or 3 (or however many you use) generating GUIDs, at the same time, at regular intervals, and write them out to a log file, and see if you get collisions (and if so, how many). That should give you a good idea of how safe this is in practice.

P.S. If you're using NHibernate's comb generator to generate GUIDs for a clustered primary key, consider using NEWSEQUENTIALID() instead of NEWID() - the whole point of Comb is to avoid page splits, and you're not accomplishing that if you have other processes using non-sequential algorithms. You should also change any code using Guid.NewGuid to use the same Comb generator - the actual Comb algorithm used in NHibernate is not complicated and easy to duplicate in your own domain logic.

† Note that there seems to be some dispute about NEWID, and whether or not it contains a timestamp. In any case, since it is based on the MAC address, the range of possible values is considerably smaller than a V4 GUID or a Comb. Further reason for me to recommend sticking to Comb GUIDs outside the database and NEWSEQUENTIALID inside the database.

Aaronaught
While I (mostly) agree with your conclusion, I must point out several errors. NEWID does not include a timestamp; and the timestamps from NEWSEQUENTIALID and Comb are stored in different bytes, so you *can* get collisions from "GUIDs" generated at different times. Also, GUID's that use timestamps (such as NEWSEQUENTIALID) do *not* fill in the rest with PRNG numbers; they use the MAC address. That's why I suggested standardizing on a single Guid generation algorithm.
Stephen Cleary
@Stephen: I can't prove or disprove that `NEWID` is timestamp-based, because documentation is scarce, but AFAIK it's based on V1 of the GUID algorithm which does use a timestamp. And the timestamp bytes for Comb and `NEWSEQUENTIALID` must be the same bytes, otherwise they wouldn't actually be sequential. (They use different sizes for the time stamp, yes, but the smaller size is 10 bytes and so the result will still be collision-free for insertion frequencies below 3.33 ms).
Aaronaught
Anyway, I've added a disclaimer; regardless of how `NEWID()` actually generates its ID, it's better to use `NEWSEQUENTIALID` on the server if you plan to use Combs on the client.
Aaronaught
`NEWID` is an RFC4122 V4 GUID (completely random except for 6 bits which make it RFC4122-compliant). `NEWSEQUENTIALID` is a V1 GUID but swaps many of its bytes to account for SQL Server's crazy ordering of GUIDs. While `NEWSEQUENTIALID` GUIDs are in fact sequential, Comb GUIDs are often *not*. They do not use the same bytes for their timestamps; `NEWSEQUENTIALID` GUIDs have timestamps in their first group, yet Comb GUIDs place it in their last group. See the links in [my blog post](http://nitoprograms.blogspot.com/2009/08/alternative-guids-for-mobile-devices.html) for the gory details.
Stephen Cleary