views:

163

answers:

5

I have a table that potentially will have high number of inserts per second, and I'm trying to choose a type of primary key I want to use. For illustrative purposes let's say, it's users table. I am trying to chose between using GUID and BIGINT as primary key and ultimately as UserID across the app. If I use GUID, I save a trip to database to generate a new ID, but GUID is not "user-friendly" and it's not possible to partition table by this ID (which I'm planning to do). Using BIGINT is much more convenient, but generating it is a problem - I can't use IDENTITY (there is a reason fro that), so my only choice is to have some helper table that would contain last used ID and then I call this stored proc:

create proc GetNewID @ID BIGINT OUTPUT
as
begin
update HelperIDTable set @ID=id, id = id + 1 
end

to get the new id. But then this helper table is an obvious bottleneck and I'm concerned with how many updates per second it can do.

I really like the idea of using BIGINT as pk, but the bottleneck problem concerns me - is there a way to roughly estimate how many id's it could produce per second? I realize it highly depends on hardware, but are there any physical limitations and what degree are we looking at? 100's/sec? 1000's/sec?

Any ideas on how to approach the problem are highly appreciated! This problem doesn't let me sleep for many night now!

Thanks! Andrey

A: 

try to hit your db with a script, perhaps with the use of jmeter to simulate concurrent hits. Perhaps you can then just measure yourself how much load you can handle. Also your DB could cause a bottle neck. Which one is it? I would prefure PostgreSQL for heavy load, like yahoo and skype also do

+1  A: 

I try to use GUID PKs for all tables except small lookup tables. The GUID concept ensures that the identity of the object can safely be created in memeory without a roundtrip to the database and saving later without changing the identity.

When you need a "human readable" id you can use an auto increment int when saved. For partitioning you could also create the BIGINTs later by a database schedule for many users in one shot.

Mischa
Actually, if you use GUIDs and the table's clustered, your performance will be horrible as the records will be inserted in random spots of the CL IX and cause a lot of page splits. I recommend against doing this.
Strommy
how's your performance? GUIDs as primary (and thus by default clustering key) are terrible for performance. Check out Kim Tripp's excellent blog post on the topic: http://sqlskills.com/BLOGS/KIMBERLY/category/Indexes.aspx
marc_s
Yes but you could create the PK nmon clustered and set a clustered index on the BIGINT column.
Mischa
+2  A: 

GUID seem to be a natural choice - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table - the single value that uniquely identifies the row in the database.

What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.

As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.

Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.

Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.

So to sum it up: unless you have a really good reason, I would always recommend a INT IDENTITY field as the primary / clustered key on your table.

Marc

marc_s
What do you mean when you say that newsequentialid() "is not truly and fully sequential"?
Kyralessa
+1 for Kim Tripp's post on GUIDs. It's a good thing to know about physical implementation...
Strommy
@kyralessa: it's sequential for a while - then there are jumps in the sequence, and then it might be sequential for a while again. I was expecting that if I insert 10'000 rows, those would have truly sequential new GUID's which wasn't the case.
marc_s
+1  A: 

Do you want a primary key, for business reasons, or a clustred key, for storage concerns? See stackoverflow.com/questions/1151625/int-vs-unique-identifier-for-id-field-in-database for a more elaborate post on the topic of PK vs. clustered key.

You really have to elaborate why can't you use IDENTITY. Generating the IDs manually, and specially on the server with an extra rountrip and an update just to generate each ID for the insert it won't scale. You'd be lucky to reach lower 100s per second. The problem is not just the rountrip and update time, but primarily from the interaction of ID generation update with insert batching: the insert batching transaction will serialize ID generation. The woraround is to separate the ID generation on separate session so it can autocommit, but then the insert batching is pointless because the ID genartion is not batched: it has to wait for log flush after each ID genrated in order to commit. Compared to this uuid will be running circles around your manual ID generation. But uuid are horrible choice for clustred key because of fragmentation.

Remus Rusanu
A: 

An idea that requires serious testing: try creating (inserting) new rows in batches -- say 1000 (10,000? 1M?) a time. You could have a master (aka bottleneck) table listing the next one to use, or you might have a query that does something like

 select min(id) where (name = '')

Generate a fresh batch of emtpy rows in the morning, every hour, or whenever you're down to a certain number of free ones. This only addresses the issue of generating new IDs, but if that's the main bottleneck it might help.

A table partitioning option: Assuming a bigint ID column, how are you defining the partition? If you are allowing for 1G rows per day, you could set up the new partition in the evening (day1 = 1,000,000,000 through 1,999,999,999, day2 = 2,000,000,000 through 2,999,999,999, etc.) and then swap it in when it's ready. You are of course limited to 1000 partitions, so with bigints you'll run out of partitions before you run out of IDs.

Philip Kelley
Mostly a thought exercise here, probably not that much help.
Philip Kelley