views:

84

answers:

7

We have some code in which we need to maintain our own identity (PK) column in SQL. We have a table in which we bulk insert data, but we add data to related tables before the bulk insert is done, thus we can not use an IDENTITY column and find out the value up front.

The current code is selecting the MAX value of the field and incrementing it by 1. Although there is a highly unlikely chance that two instances of our application will be running at the same time, it is still not thread-safe (not to mention that it goes to the database everytime).

I am using the ADO.net entity model. How would I go about 'reserving' a range of id's to use, and when that range runs out, grab a new block to use, and guarantee that the same range will not be used.

+2  A: 

If it's viable to change the structure of the table, then perhaps use a uniqueidentifier for the PK instead along with newid() [SQL] or Guid.NewGuid() [C#] in your row generation code.

From Guid.NewGuid() doco:

There is a very low probability that the value of the new Guid is all zeroes or equal to any other Guid.

Reddog
I'm no DBA, but it seems like since this is the PK and would have a clustered index, it would be an awful way of maintaining identity as it would guarantee page splits.
esac
Perhaps see conclusions in: http://www.sql-server-performance.com/articles/per/guid_performance_p2.aspxYour statement about page splits is merely that they are increased by the fact that your key is larger in size. But you should also note that the natural ordering of your inserted rows will no longer be preserved -- not necessarily a bad thing.Also, see comments above to your question, you should more likely fix your inserts such that dependent rows are inserted second so that foreign key integrity is preserved at all times.
Reddog
This is IMHO, the only real way of resolving this issue. It has the added benefit that you can assign keys in code prior to sending the data to the server for bulk insert.
Chris Lively
A: 

Two clients could reserve the same block of id's.

There is no solution short of serializing your inserts by locking.

See Locking Hints in MSDN.

Bill Karwin
+1  A: 
  • use more universal unique identifier data type like UNIQUEIDENTIFIER (UUID) instead of INTEGER. In this case you can basically create it on the client side, pass it to the SQL and do not have to worry about it. The disadvantage is that, of course, the size of this field.
  • create a simple table in the database CREATE TABLE ID_GEN (ID INTEGER IDENTITY), and use this as a factory to give you the identifiers. Ideally you would create a stored procedure (or function), to which you would pass the number of identifiers you need. The stored procedure will then insert this number of rows (empty) into this ID_GEN table and will return you all new ID's, which you can use in your code. Obviously, your original tables will not have the IDENTITY anymore.
  • create your own variation of the ID_Factory above.

I would choose simplicity (UUID) if you are not constrained otherwise.

van
The factory method is what I ended up using. It has the advantage of keeping the key size smaller than a guid, and ordered.
esac
A: 

I fyou have a lot of child tables you might not want to change the PK. PLus the integer filedsa relikely to perform better in joins. But you could still add a GUID field and populate it in the bulk insert with pre-generated values. Then you could leave the identity insert alone (almost alawys a bad idea to turn it off) and use the GUID values you pre-generated to get back the Identity values you just inserted for the insert into child tables.

If you use a regular set-based insert (one with the select clause instead of the values clause) instead of a bulk insert, you could use the output clause to get the identities back for the rows if you are using SQL Server 2008.

HLGEM
A: 

The Hi/Lo algorithm may be of interest to you:

http://stackoverflow.com/questions/282099/whats-the-hi-lo-algorithm

Bryan Watts
A: 

The most general solution is generate client identifiers that never across with database identifiers - usually it is negative values, then update identifiers with identifier generated by database on inserting.

This way is safe to use in application with many users inserts the data simultaneously. Any other ways except GUIDs are not multiuser-safe.

But if you have that rare case when entity's primary key is required to be known before entity is saved to database, and it is impossible to use GUID, you may use identifier generation algorithm which are prevent identifier overlapping. The most simple is assigning a unique identifier prefix for each connected client, and prepend it to each identifier generated by this client.

If you are using ADO.NET Entity Framework, you probably should not worry about identifier generation: EF generates identifiers by itself, just mark primary key of the entity as IsDbGenerated=true.

Strictly saying, entity framework as other ORM does not require identifier for objects are not saved to database yet, it is enought object reference for correctly operating with new entities. Actual primary key value is required only on updating/deleting entity, and on updating/deleting/inserting entity that references new entity, e.i. in cases when actual primary key value is about to be written in database. If entity is new, it is impossible to save other entites that are referenced new entity until new entity is not saved to database, and ORMs maintains specific order of entities saving which take references map into account.

STO
+1  A: 

Why are you using ADO.net Entity Framework to do what sounds like ETL work? (See critique of ADO.NET Entity Framework and ORM in general below. It is rant free).

Why use ints at all? Using a uniqueidentifier would solve the "multiple instances of the application running" issue.

Using a uniqueidentifier as a column default will be slower than using an int IDENTITY... it takes more time to generate a guid than an int. A guid will also be larger (16 byte) than an int (4 bytes). Try this first and if it results in acceptable performance, run with it.

If the delay introduced by generating a guid on each row insert it unacceptable, create guids in bulk (or on another server) and cache them in a table.

Sample TSQL code:

CREATE TABLE testinsert ( date_generated datetime NOT NULL DEFAULT GETDATE(), guid uniqueidentifier NOT NULL, TheValue nvarchar(255) NULL ) GO

CREATE TABLE guids ( guid uniqueidentifier NOT NULL DEFAULT newid(), used bit NOT NULL DEFAULT 0, date_generated datetime NOT NULL DEFAULT GETDATE(), date_used datetime NULL ) GO

CREATE PROCEDURE GetGuid @guid uniqueidentifier OUTPUT AS BEGIN SET NOCOUNT ON DECLARE @return int = 0

BEGIN TRY BEGIN TRANSACTION SELECT TOP 1 @guid = guid FROM guids WHERE used = 0

IF @guid IS NOT NULL UPDATE guids SET used = 1, date_used = GETDATE() WHERE guid = @guid ELSE BEGIN SET @return = -1 PRINT 'GetGuid Error: No Unused guids are available' END COMMIT TRANSACTION END TRY

BEGIN CATCH SET @return = ERROR_NUMBER() -- some error occurred SET @guid = NULL PRINT 'GetGuid Error: ' + CAST(ERROR_NUMBER() as varchar) + CHAR(13) + CHAR(10) + ERROR_MESSAGE() ROLLBACK END CATCH

RETURN @return END GO

CREATE PROCEDURE InsertIntoTestInsert @TheValue nvarchar(255) AS BEGIN SET NOCOUNT ON DECLARE @return int = 0

DECLARE @guid uniqueidentifier DECLARE @getguid_return int

EXEC @getguid_return = GetGuid @guid OUTPUT

IF @getguid_return = 0 BEGIN INSERT INTO testinsert(guid, TheValue) VALUES (@guid, @TheValue) END ELSE SET @return = -1

RETURN @return END GO

-- generate the guids INSERT INTO guids(used) VALUES (0) INSERT INTO guids(used) VALUES (0)

--Insert data through the stored proc EXEC InsertIntoTestInsert N'Foo 1' EXEC InsertIntoTestInsert N'Foo 2' EXEC InsertIntoTestInsert N'Foo 3' -- will fail, only two guids were created

-- look at the inserted data SELECT * FROM testinsert

-- look at the guids table SELECT * FROM guids

The fun question is... how do you map this to ADO.Net's Entity Framework?

This is a classic problem that started in the early days of ORM (Object Relational Mapping).

If you use relational-database best practices (never allow direct access to base tables, only allow data manipulation through views and stored procedures), then you add headcount (someone capable and willing to write not only the database schema, but also all the views and stored procedures that form the API) and introduce delay (the time to actually write this stuff) to the project.

So everyone cuts this and people write queries directly against a normalized database, which they don't understand... thus the need for ORM, in this case, the ADO.NET Entity Framework.

ORM scares the heck out of me. I've seen ORM tools generate horribly inefficient queries which bring otherwise performant database servers to their knees. What was gained in programmer productivity was lost in end-user waiting and DBA frustration.

Marc Jellinek
I agree about the scariness of ORM. Some of the existent code uses ADO.Net EM and has done what you just said, slowed the system to a crawl. But I have also written larger scale applications using EM, and paid attention to profiling (visual studio + sql profiler) and looking at the generated queries, and fixed it to be efficient.
esac