views:

143

answers:

9

If I have a table of a hundred users normally I would just set up an auto-increment userID column as the primary key. But if suddenly we have a million users or 5 million users then that becomes really difficult because I would want to start becoming more distributed in which case an auto-increment primary key would be useless as each node would be creating the same primary keys.

Is the solution to this to use natural primary keys? I am having a real hard time thinking of a natural primary key for this bunch of users. The problem is they are all young people so they do not have national insurance numbers or any other unique identifier I can think of. I could create a multi-column primary key but there is still a chance, however miniscule of duplicates occurring.

Does anyone know of a solution?

Thanks

+8  A: 

The standard solution here is to use a GUID. They won't perform as well in terms of indexing, though.

RedFilter
As you probably know, you can sacrifice some of the GUID's uniqueness by replaced half or a quarter of the GUID with a DateTime. I believe this is called a COMB guid. The index performance gets pretty close to that of an int. That said, the GUID will consume more space in the Pages and cause more splitting.
Thomas
when you hit 5 million users, won't you need every bit of performance you can get? You'll waste cache memory indexing long GUIDs on this table and many FKs to it.
KM
+11  A: 

I would say that for the time being keep an auto increment for the user ID.

When you do have that sudden rush of millions of users, then you can think about changing it.

In other words, solve the problem when you have it. "premature optimization is the root of all evil.".

To answer the question - some auto increments will allow you to seed the auto increment, so you can get different auto increments on the different nodes. This will avoid the problem, while still allowing use of an auto increment.

Oded
While I'm as opposed to premature/unnecessary optimization as anyone, I'm **much more** opposed to changing primary keys on a table that's in use.
Adam Robinson
@Adam Robinson - I absolutely agree. However, one needs to also be realistic about certain problems coming up.
Oded
I agree with Adam. I might have voted Oded down if I thought Christopher was ever going to experience a problem with an identity field.
uncle brad
+1  A: 

Never use natural primary keys unless you want bad performance and the potential for bad data. There are very few natural keys that are nto subject to change over time especially names. If a natural key changes, then all related child records must also change. This is clearly bad.

You could use GUIDS. But 5 million is nothing in terms of data and likely would not require a change. We have over 10,000,000 different people in our system and we only have a medium sized database with no partioning or need for GUIDs.

HLGEM
A: 

A GUID is an easy way out but...

How distributed does it need to be? If it is a limited number of databases you can give each database a range of numbers to use. So for example the first database auto generates numbers in the range of 0 to 999,999 and the next uses 1,000,000 to 1,999,999. That way they can each generate a user id without bumping into each other. If the database includes a unique number identifing it then the ranges can be generated automatically from this number.

I don't think you can use an auto-increment column to do this but a stored procedure could generate numbers in this manner.

Kevin Gale
+2  A: 

GUIDs are good, but are subject to collision (albeit rare).

This might be a nonstandard solution, but I'm gonna throw it out there:

You can use auto-incrementing numbers, but segregate numberspace according to distribution in the future.

So let's say you have 3 servers. Record the IDs as follows:

Server 1: 0 - 9,999,999
Server 2: 10,000,000 - 19,999,999
Server 3: 20,000,000 - 29,999,999

Even within the constraints of a 32-bit int, that should leave plenty of expansion space (could even use gaps of 100,000,000 if you're worried), and it essentially guarantees uniqueness across the system.

Jon Seigel
A: 

GUIDs are rubbish as keys when clustered. If non-clustered, you'll still need a clustered index on another column.

Use an integer key and for each new node/site

  • Increment in steps of 10. As you add nodes, just start at 2, 3, etc
  • Use ranges eg 1-> 1000000, 1000000 -> 1999999 etc
  • And don't forget -ve too. For example, you can have IDENTITY (-1,-1) for a 2nd node

If you do have nodes/sites then a second column with SiteID will work too.

gbn
Of course, the downvoter knows all about GUIDs being superior...?
gbn
+2  A: 

if you need millions of IDs and have many nodes, make the primary key a composite of:

NodeID  int   --unique for each node 2 or 4 byte  
UserID  int   --auto increment 8 byte, repeats for each node

which is way better than a GUID (smaller, uses less memory, and will be faster)

KM
A: 

If you're using MSSQL, you can create the PK of your table as UNIQUEIDENTIFIER and set the Default Value or Binding to NEWID().

Todd Sprang
A: 

Dear Christopher McCann

I suggest you to never consider GUIDs one reason is that currently i am having trouble with them suppose if you have millions of users then you may need a bigger degree of concurrency and Guids will ruin your life while Insert and delete because you will have an index on them and in default it will be a Clustered index that mean when you have a clustered index every insert and delete will move the record physically and moreover Guids are not sequential so there would be a chance of zero that each new insert come at the bottom or top on the page. so the overall insert and delete operation will become very costly and if you remove the index then your selects will become costly.

Specially if you have multiple tables and there are relations between them don't consider Guids as Primary Key.

There are following Two solution i would recommend.

  1. if you can make composite keys that will be perfect like if its a bank software then might be branchId, transactionId will become the primary key where branchId is identity of the node inserting the record and transactionId is an auto num at branch so you will get uniqueness all the way.

  2. if above is not what you like to do or considering then you can use the Guid as a unique filed but add an auto increment number as primary key this will help you to reduce the overall cost like when client(node) sends data using (web service) RPC then you have to insert record into server database then an autonumber will be generated and this autonumber can be use for future select, delete or update but client dont have to know about this autonumber

i understand that second solution is a bit confusing and complex but its still better than using Guids as PK. but if solution 1 is applicable go for it.

When I am saying Cost its not only the processing time but its lock(wait) time as well that is totally the waste of money and your quad core server may be performing half of it and more locks mean more chance of deadlocks so my friend never use Guids.

Regards Mubashar

Mubashar Ahmad