views:

487

answers:

3

From my experience I have learn that using an surrogate INT data type column as primary key esp. an IDENTITY key column offers better performance than using GUID or char/varchar data type column as primary key. I try to use IDENTITY key as primary key wherever possible. But recently I came across a schema where the tables were horizontally partitioned and were managed via a Partitioned view. So the tables could not have an IDENTITY column since that would make the Partitioned View non updatable. One work around for this was to create a dummy 'keygenerator' table with an identity column to generate IDs for primary key. But this would mean having a 'keygenerator' table for each of the Partitioned View. My next thought was to use float as a primary key. The reason is the following key algorithm that I devised

DECLARE @KEY FLOAT

SET @KEY = CONVERT(FLOAT,GETDATE())/100000.0 

SET @KEY = @EMP_ID + @KEY

Heres how it works.

CONVERT(FLOAT,GETDATE()) 

gives float representation of current datetime since internally all datetime are represented by SQL as a float value.

CONVERT(FLOAT,GETDATE())/100000.0 

converts the float representation into complete decimal value i.e. all digits are pushed to right side of ".".

@KEY = @EMP_ID + @KEY

adds the Employee ID which is an integer to this decimal value.

The logic is that the Employee ID is guaranteed to be unique across sessions since an employee cannot connect to an application more than once at the same time. And for the same employee each time a key will be generated the current datetime will be unique.

In all an unique key across all employee sessions and across time.

So for Emp Ids 11 and 12, I have key values like 12.40046693321566357, 11.40046693542361111

But my concern whether float data type as primary key offer benefits compared to choosing GUID or char/varchar as primary keys. Also important thing is because of partitioning the float column is going to be part of a composite key.

+1  A: 

I would not consider such an unorthodox key generation schema - this has the taste of a bad hack. Why don't you just use integers? There are many ways and algorithms to coordinate distributed key generation. From locking the whole table(s) and searching the next free id over preallocating id ranges per client to deriving it from client specific information (similar to your employee+time suggestion).

Daniel Brückner
Locking the whole table and searching the next free id - this method i am trying to avoid since locking a table with millions of rows and scanning it for the next available id can become a performance bottle neck Preallocating id ranges - I am trying to find out whether a method that avoids table scans can be derived this way
devanalyst
It would actually only have to lock the index. And searching a million+ index is not so bad. That is, after all, what they're for.
P Daddy
I just mentioned locking the table for completeness but really don't recommend it (in highly concurrent environments).
Daniel Brückner
A: 

Since you didn't mention an rdbms I'll asume SQL server. When creating a Primary Key, an clustered index on that key is also created. A table is sorted in the order of this key. When using Guids as a primary key (with a clustered index) each insert means a reordering of the table. This also holds true for your float representation. Other issues aside, if you wish to use this scheme, do not create a clustered index on this primary key.

edosoft
+1  A: 

Also important thing is because of partitioning the float column is going to be part of a composite key.

What? Why? You've gone through great pains in attempt to make this employee/time based value unique, what else would you need in the primary key? And on the other side of that question, are the other components of your key unique already? If so, why not just use them?

Your scheme leaves a bad taste in my mouth. I'm not quite sure why, though, because, the more I think about it, the more solid it seems.

  • At first I worried about performance. But a float is just 8 bytes (assuming your DBMS uses IEEE 754 double), which just isn't all that big. That's no worse than having a 64-bit integer as a key, or two 32-bit ints. Your key generation process is the only thing that might be slowed down, but even that not by much.
  • I then worried about uniqueness. This scheme doesn't guarantee that you won't generate the same key twice. But given your assertion that the combination of user and datetime will be unique, then this might actually work:
    • An IEEE 754 double has 53 bits of precision.
    • The datetime will use 42 bits. Assumptions:
      • Resolution of datetime is 1/300 second (3.33... ms). This is true for MS SQL Server, at least.
      • ceiling(log2(86400 * 300 * 100000)) = 42
    • This leaves 9 bits for your employee ID. If the employee ID is greater than 511, then you will lose part of the datetime, but it will be on the order of milliseconds. Your employee ID can reach 131071 before you will lose accuracy of more than a second.
  • I then worried about the difficulty in looking up a key value later. Given the 0.2 != 0.1 + 0.1 problem, concerns of floating-point equality always come to mind. But there's no reason you would be performing any calculations on this key value, and presumably it would be in IEEE 754 double format at any given time (be it in the table, in stored proc variables, or in variables in your executable), then it should never change and can be treated as a unique 64-bit value.

After considering all this, your scheme does appear relatively safe. Edoode's suggestion about not clustering the index is a good one, and with that in mind, as well as my caveats above about the size of your employee ID, you can use this scheme to generate primary keys just about as well as any other method.

I still question whether it's the best method, though, or if it's even necessary.

  • Can the other components of the composite key not be used by themselves (i.e., as a natural key)?

  • You could, as you suggest, keep a sequential key seed in another table. And you would need only one table, not one table per partition as you assume. You would simply need two columns in this table: one for the partition number, and one for the current identity value of that partition.

  • Using a GUID or varchar primary key isn't out of the question. Many people do this on many different tables. It won't kill your performance. And it might be more straight-forward, or at least more easily understood, than this scheme.

  • If your composite key already includes the employee ID, you could just add a datetime column to the key and call it a day. Or if not, you could add both columns. There's no reason you have to mash the two together.

HTH

P Daddy
I finally made up my mind to use composite primary key as suggested by PDaddy. Initially I thought using composite keys might degrade performance but in my case using a combination of Int and datetime fields does not make much difference. The composite key is non clustered and I have created a clustered index on the Int field
devanalyst
Forgot to mention that the clustered index is created on the Int field because its going to be used in joins
devanalyst