views:

608

answers:

6

I'm considering an optimisation in a particularly heavy part of my code. It's task is to insert statistical data into a table. This data is being hit a fair amount by other programs. Otherwise I would consider using SQL Bulk inserts etc.

So my question is...

Is it ok to try and insert some data knowing that it might (not too often) throw a SqlException for a duplicate row?

Is the performance hit of an exception worse than checking each row prior to insertion?

A: 

Yes i would always check. for example email address, or IP Address.

If your worried about performance use stored procedures to handle logic if duplicate.

Elijah Glover
+1  A: 

I think that is better use a store procedure, and use an IF

IF (SELECT COUNT(*) FROM X WHERE Y=Z)=0 INSERT INTO (X) VALUES('XX')....

and you can add an ELSE condition...

NOT EXISTS is (should be) always faster than COUNT.
Hosam Aly
Yes, not only will EXISTS test be quicker than COUNT(*), on a large table COUNT(*) will be slow, even if column is indexed
Kristen
+4  A: 

First, my advice is to err on the side of correctness, not speed. When you finish your project and profiling shows that you lose significant time checking that rows exist before inserting them, only then optimize it.

Second, I think there's syntax for inserting and skipping if there are duplicates in all RDBMS, so this shouldn't be a problem in the first place. I try to avoid exceptions as part of the normal application flow and leave them for truly exceptional cases. That is, don't count on exceptions in the DB to work around logic in your code. Maintain as much consistency on your end (code), and let DB exceptions indicate only true bugs.

Assaf Lavie
Some DBMS platforms have a merge command that does this but not all do.
ConcernedOfTunbridgeWells
Thanks for the EXISTS tip
Harry
+5  A: 

At a purely performance level, it is faster to just insert the data and handle the error, particularly if the error is infrequent. Running a select query to check for a duplicate, processing that result, and then inserting if successful, will be considerably slower than inserting and handling an occasional error. If its raising an exception, then that will be a bit slower since exception handling is slow in most languages, but handling a exception will be many times faster than an SQL query in any language.

As Assaf said, there is also usually a way to explicitly handle duplicates so you might be able to avoid the error altogether. This would increase performance further and allow you to be explicit that you're handling duplicates in a particular way.

Whether or not to use stored procedures is up to you - it can help with performance at the expense of putting more logic in your database. That's a decision you have to make. I've had bad experiences with this, but it depends on the RDBMS and the language you're using.

wuputah
A: 

Might not be very relevant to question over here.

But I worked in a project, where before dropping a table programatically it used to check whether the table exists or not.

When we started to optimize the code, we changed it to handle the specific exception rather than checking for the existence of the table before deleting it.

And we have some considerable time improvement, as we were using the existence check quite a few times in a process flow.

With this line of thought, I assume it is less expensive to handle the exception rather that another database query for the check.

Biswanath
A: 

I don't see that a Stored Procedure will speed up a singleton insert, unless there is other processing involved. May eb other benefits though - keeping all the logic in once place etc.

Personally I would INSERT and trap the error if there are duplicates etc.

If you do an IF NOT EXISTS check and then conditionally INSERT there is an interval where the transaction needs to hold a lock, which may increase blocking on the table.

For a busy insert table worth checking how fragmented the table becomes. If you are inserting with a Clustered PK on an Identity / Auto-number column then all inserts are at one end (and the index can have a fill-factor of 100%), but if the inserts are random throughout the clustered index then you may be getting issues from page splits and so on.

A related issue is that the Statistics on the table are likely to go out of date quickly, and that may effect the performance of queries with cached query plans etc.

Kristen