views:

2344

answers:

11

When writing database queries in something like TSQL or PLSQL, we often have a choice of iterating over rows with a cursor to accomplish the task, or crafting a single SQL statement that does the same job all at once.

Also, we have the choice of simply pulling a large set of data back into our application and then processing it row by row, with C# or Java or PHP or whatever.

Why is it better to use set-based queries? What is the theory behind this choice? What is a good example of a cursor-based solution and its relational equivalent?

+8  A: 

The main reason that I'm aware of is that set-based operations can be optimised by the engine by running them across multiple threads. For example, think of a quicksort - you can separate the list you're sorting into multiple "chunks" and sort each separately in their own thread. SQL engines can do similar things with huge amounts of data in one set-based query.

When you perform cursor-based operations, the engine can only run sequentially and the operation has to be single threaded.

Matt Hamilton
A: 

The idea behind preferring to do the work in queries is that the database engine can optimize by reformulating it. That's also why you'd want to run EXPLAIN on your query, to see what the db is actually doing. (e.g. taking advantage of indices, table sizes and sometimes even knowledge about the distributions of values in columns.)

That said, to get good performance in your actual concrete case, you may have to bend or break rules.

Oh, another reason might be constraints: Incrementing a unique column by one might be okay if constraints are checked after all the updates, but generates a collision if done one-by-one.

Anders Eurenius
A: 

set based is done in one operation cursor as many operations as the rowset of the cursor

SQLMenace
+4  A: 

Set based queries are (usually) faster because:

  1. They have more information for the query optimizer to optimize
  2. They can batch reads from disk
  3. There's less logging involved for rollbacks, transaction logs, etc.
  4. Less locks are taken, which decreases overhead
  5. Set based logic is the focus of RDBMSs, so they've been heavily optimized for it (often, at the expense of procedural performance)

Pulling data out to the middle tier to process it can be useful, though, because it removes the processing overhead off the DB server (which is the hardest thing to scale, and is normally doing other things as well). Also, you normally don't have the same overheads (or benefits) in the middle tier. Things like transactional logging, built-in locking and blocking, etc. - sometimes these are necessary and useful, other times they're just a waste of resources.

A simple cursor with procedural logic vs. set based example (T-SQL) that will assign an area code based on the telephone exchange:

--Cursor
DECLARE @phoneNumber char(7)
DECLARE c CURSOR LOCAL FAST_FORWARD FOR
   SELECT PhoneNumber FROM Customer WHERE AreaCode IS NULL
OPEN c
FETCH NEXT FROM c INTO @phoneNumber
WHILE @@FETCH_STATUS = 0 BEGIN
   DECLARE @exchange char(3), @areaCode char(3)
   SELECT @exchange = LEFT(@phoneNumber, 3)

   SELECT @areaCode = AreaCode 
   FROM AreaCode_Exchange 
   WHERE Exchange = @exchange

   IF @areaCode IS NOT NULL BEGIN
       UPDATE Customer SET AreaCode = @areaCode
       WHERE CURRENT OF c
   END
   FETCH NEXT FROM c INTO @phoneNumber
END
CLOSE c
DEALLOCATE c
END

--Set
UPDATE Customer SET
    AreaCode = AreaCode_Exchange.AreaCode
FROM Customer
JOIN AreaCode_Exchange ON
    LEFT(Customer.PhoneNumber, 3) = AreaCode_Exchange.Exchange
WHERE
    Customer.AreaCode IS NULL
Mark Brackett
+2  A: 

I think the real answer is, like all approaches in programming, is that it depends on which one is better. Generally, a set based language is going to be more efficient, because that is what it was designed to do. There are two places where a cursor is at an advantage:

  1. You are updating a large data set in a database where locking rows is not acceptable (during production hours maybe). A set based update has a possibility of locking a table for several seconds (or minutes), where a cursor (if written correctly) does not. The cursor can meander through the rows updating one at a time and you don't have to worry about affecting anything else.

  2. The advantage to using SQL is that the bulk of the work for optimization is handled by the database engine in most circumstances. With the enterprise class db engines the designers have gone to painstaking lengths to make sure the system is efficient at handling data. The drawback is that SQL is a set based language. You have to be able to define a set of data to use it. Although this sounds easy, in some circumstances it is not. A query can be so complex that the internal optimizers in the engine can't effectively create an execution path and guess what happens.......your super powerful box with 32 processors uses a single thread to execute the query because it doesn't know how to do anything else, so you waste processor time on the database server which generally there is only one of as opposed to multiple application servers (so back to reason 1, you run into resource contentions with other things needing to run on the database server). With a row based language (C#, PHP, JAVA etc.), You have more control as to what happens. You can retrieve a data set and force it to execute the way you want it to. (Separate the data set out to run on multiple threads etc). Most of the time, it still isn't going to be efficient as running it on the database engine, because it will still have to access the engine to update the row, but when you have to do 1000+ calculations to update a row (and lets say you have a million rows), a database server can start to have problems.

Kevin
+7  A: 

In addition to the above "let the DBMS do the work" (which is a great solution), there are a couple other good reasons to leave the query in the DBMS:

  • It's (subjectively) easier to read. When looking at the code later, would you rather try and parse a complex stored procedure (or client-side code) with loops and things, or would you rather look at a concise SQL statement?
  • It avoids network round trips. Why shove all that data to the client and then shove more back? Why thrash the network if you don't need to?
  • It's wasteful. Your DBMS and app server(s) will need to buffer some/all of that data to work on it. If you don't have infinite memory you'll likely page out other data; why kick out possibly important things from memory to buffer a result set that is mostly useless?
  • Why wouldn't you? You bought (or are otherwise using) a highly reliable, very fast DBMS. Why wouldn't you use it?
Matt Rogish
I agree with Matt. Reading some Joe Celko's books also helps when making some of these decisions.
Saif Khan
You forgot to mention the query optimization and declarative nature of SQL; cursors and other row based approaches make define exactly how to retrieve/process the data, where SQL queries only define what to do - the RDBMS is then free to come up with the best plan based on the statistics (for example depending on the statistics index seek might be worse or better approach then index scan; RDBMS can make a distinction; row based approaches can not...)
Unreason
+1  A: 

You wanted some real-life examples. My company had a cursor that took over 40 minutes to process 30,000 records (and there were times when I needed to update over 200,000 records). It took 45 second to do the same task without the cursor. In another case I removed a cursor and sent the processing time from over 24 hours to less than a minute. One was an insert using the values clause instead of a select and the other was an update that used variables instead of a join. A good rule of thumb is that if it is an insert, update, or delete, you should look for a set-based way to perform the task.

Cursors have their uses (or the code wouldn't be their in the first place), but they should be extremely rare when querying a relational database (Except Oracle which is optimized to use them). One place where they can be faster is when doing calculations based on the value of the preceeding record (running totals). BUt even that should be tested.

Another limited case of using a cursor is to do some batch processing. If you are trying to do too much at once in set-based fashion it can lock the table to other users. If you havea truly large set, it may be best to break it up into smaller set-based inserts, updates or deletes that will not hold the lock too long and then run through the sets using a cursor.

A third use of a cursor is to run system stored procs through a group of input values. SInce this is limited to a generally small set and no one should mess with the system procs, this is an acceptable thing for an adminstrator to do. I do not recommend doing the same thing with a user created stored proc in order to process a large batch and to re-use code. It is better to write a set-based version that will be a better performer as performance should trump code reuse in most cases.

HLGEM
Wanted to add this link:http://wiki.lessthandot.com/index.php/Cursors_and_How_to_Avoid_Them
HLGEM
+1  A: 

I think it comes down to using the database is was designed to be used. Relational database servers are specifically developed and optimized to respond best to questions expressed in set logic.

Functionally, the penalty for cursors will vary hugely from product to product. Some (most?) rdbmss are built at least partially on top of isam engines. If the question is appropriate, and the veneer thin enough, it might in fact be as efficient to use a cursor. But that's one of the things you should become intimately familiar with, in terms of your brand of dbms, before trying it.

le dorfier
+1  A: 

As has been said, the database is optimized for set operations. Literally engineers sat down and debugged/tuned that database for long periods of time. The chances of you out optimizing them are pretty slim. There are all sorts of fun tricks you can play with if you have a set of data to work with like batching disk reads/writes together, caching, multi-threading. Also some operations have a high overhead cost but if you do it to a bunch of data at once the cost per piece of data is low. If you are only working one row at a time, a lot of these methods and operations just can't happen.

For example, just look at the way the database joins. By looking at explain plans you can see several ways of doing joins. Most likely with a cursor you go row by row in one table and then select values you need from another table. Basically it's like a nested loop only without the tightness of the loop (which is most likely compiled into machine language and super optimized). SQL Server on its own has a whole bunch of ways of joining. If the rows are sorted, it will use some type of merge algorithm, if one table is small, it may turn one table into a hash lookup table and do the join by performing O(1) lookups from one table into the lookup table. There are a number of join strategies that many DBMS have that will beat you looking up values from one table in a cursor.

Just look at the example of creating a hash lookup table. To build the table is probably m operations if you are joining two tables one of length n and one of length m where m is the smaller table. Each lookup should be constant time, so that is n operations. so basically the efficiency of a hash join is around m (setup) + n (lookups). If you do it yourself and assuming no lookups/indexes, then for each of the n rows you will have to search m records (on average it equates to m/2 searches). So basically the level of operations goes from m + n (joining a bunch of records at once) to m * n / 2 (doing lookups through a cursor). Also the operations are simplifications. Depending upon the cursor type, fetching each row of a cursor may be the same as doing another select from the first table.

Locks also kill you. If you have cursors on a table you are locking up rows (in SQL server this is less severe for static and forward_only cursors...but the majority of cursor code I see just opens a cursor without specifying any of these options). If you do the operation in a set, the rows will still be locked up but for a lesser amount of time. Also the optimizer can see what you are doing and it may decide it is more efficient to lock the whole table instead of a bunch of rows or pages. But if you go line by line the optimizer has no idea.

The other thing is I have heard that in Oracle's case it is super optimized to do cursor operations so it's nowhere near the same penalty for set based operations versus cursors in Oracle as it is in SQL Server. I'm not an Oracle expert so I can't say for sure. But more than one Oracle person has told me that cursors are way more efficient in Oracle. So if you sacrificed your firstborn son for Oracle you may not have to worry about cursors, consult your local highly paid Oracle DBA :)

Cervo
A: 

The REAL answer is go get one of E.F. Codd's books and brush up on relational algebra. Then get a good book on Big O notation. After nearly two decades in IT this is, IMHO, one of the big tragedies of the modern MIS or CS degree: Very few actually study computation. You know...the "compute" part of "computer"? Structured Query Language (and all its supersets) is merely a practical application of relational algebra. Yes, the RDBMS have optimized memory management and read/write but the same could be said for procedural languages. As I read it, the original question is not about the IDE, the software, but rather about the efficiency of one method of computation vs. another.

Even a quick familiarization with Big O notation will begin to shed light on why, when dealing with sets of data, iteration is more expensive than a declarative statement.

A: 

Simply put, in most cases, it's faster/easier to let the database do it for you.

The database's purpose in life is to store/retrieve/manipulate data in set formats and to be really fast. Your VB.NET/ASP.NET code is likely nowhere near as fast as a dedicated database engine. Leveraging this is a wise use of resources.

sheepsimulator