views:

351

answers:

6

Hello,

I'm looking for any suggestions to optimize the following PROC SQL statement from a SAS program. The two tables involved contain around 5 million records each and the runtime is about 46 hours.

The statement is looking to update a "new" version of the "old" table. Noting a column if the "old" table, for a "PK_ID", was listed without a value for "3RD_ID" and "CODE", but in the "new" table, for a "PK_ID", it is now listed WITH a value for "3RD_ID" and "CODE".

Thanks for any suggestions... (The code is really formatted below! For some reasons my spaces aren't showing for indents...)

PROC SQL _METHOD;  
 UPDATE NEW_TABLE AS N  
   SET NEW_2ND_ID=(SELECT 2ND_ID FROM OLD_TABLE AS O  
                WHERE N.PK_ID=0.PK_ID  
                  AND N.2ND_ID<>O.2ND_ID  
                  AND O.3RD_ID IS NULL  
                  AND O.CODE IS NULL  
                  AND N.3RD_ID IS NOT NULL  
                  AND N.CODE IS NOT NULL  
                  AND N.2ND_ID IS NOT NULL)  
        WHERE N.3RD_ID IS NOT NULL  
          AND N.PK_ID IS NOT NULL  
          AND N.CODE IS NOT NULL  
          AND N.2ND_ID IS NOT NULL;  
QUIT;   
A: 

I think right now it is nested sub query, so Select statement will be fired for the number of records which matches the where condition.

However i would recommend you to go for SQL - Update with a Join. Explaination can be found here: http://bytes.com/topic/oracle/answers/65819-sql-update-join-syntax.

When you have update with JOIN in place, then apply proper Indexes.

Also not that you should not use those indexes which include 2nd_id, they should be disabled and then rebuild after the update, as it can be mass data update.

Nitin Midha
Not every DBMS treats correlated subqueries as individual queries. SQL Server 2005 and up is pretty good about converting most correlated subqueries to regular joins.
Emtucifor
A: 
UPDATE (
    SELECT O.2ND_ID, N.2ND_ID 
    FROM OLD_TABLE AS O  
    INNER JOIN NEW_TABLE AS N on O.PK_ID=N.PK_ID
    WHERE N.2ND_ID <> O.2ND_ID  
    AND O.3RD_ID IS NULL  
    AND O.CODE IS NULL  
    AND N.3RD_ID IS NOT NULL  
    AND N.CODE IS NOT NULL  
    AND N.2ND_ID IS NOT NULL
) t 
set N.2ND_ID = O.2ND_ID
RedFilter
Will only work if `pk_id` is covered by a `PRIMARY KEY` or a `UNIQUE` index in `old_table`.
Quassnoi
@Quassnoi: If that is a problem, then the initial poster's query was wrong, too.
Emtucifor
@Emtucifor: the original query would work if the `pk_id` values were unique, even if the column were not declared as `UNIQUE`. This query requires that `old_table.pk_id` is declared as `UNIQUE`, otherwise it will fail with infamous `ORA-01779`.
Quassnoi
@Quassnoi: Thanks for explaining! I guess I was ignorant. I use Oracle seldom but have run into this exact problem there, and after much research I never found a good way to do it (I think I eventually used one of the array-like data types in Oracle). For what it's worth, in SQL Server you can do a non-unique update no problem, but the poster's original query would fail if the subquery ever returned more than one inner row for any outer row.
Emtucifor
@Emtucifor: the goal of the key preservation is to make sure that each row in the target table will be returned at most once.
Quassnoi
@Quassnoi, yes I understand it, but sometimes it either doesn't matter which row is used for the update, or the update IS unique but it can't be proven to the server easily.
Emtucifor
+1  A: 

Don't use update, create a similar new table, and use a insert into (fields) select fields from both tables.

  • Drop indexes before running the query.
  • Drop triggers before running the query.

Something like:

insert into NEW_TABLE (field1,field2,NEW_2ND_ID)  
select field1, field2, (SELECT 2ND_ID FROM OLD_TABLE....) from NEW_TABLE
  • Recreate indexes after query finished.
  • Recreate triggers after query finished.

(In the end you will replace with this new table your existing table)

Pentium10
I see you are posting comments on other proposed answers and haven't on mine so far. Have you tried the solution proposed by me?
Pentium10
I have already used part of your solution. I have created a sub-set table from the contents needed from the "old" table. I will look to creating an insert if I am unable to find a solution to an update optimization.
Andy
Please consider that `update` is slow, because of the locks usage, and IO seeks, compared to a `select all`, `insert all`. Sometimes it's better to flow the data into one separate table.
Pentium10
+3  A: 

I am not familiar with the variant of SQL you're using. However, whether you get better performance or not, you should be using the ANSI join syntax. Here is what it would look like in T-SQL, modify it for your system:

UPDATE N
SET N.2ND_ID = O.2ND_ID
FROM
   NEW_TABLE AS N
   INNER JOIN OLD_TABLE AS O ON N.PK_ID = O.PK_ID
WHERE
   N.2ND_ID <> O.2ND_ID
   AND N.3RD_ID IS NOT NULL  
   AND O.3RD_ID IS NULL  
   AND N.CODE IS NOT NULL  
   AND O.CODE IS NULL

Note that the extra conditions I removed aren't necessary, for example N.2ND_ID <> O.2ND_ID already guarantees that those two columns aren't null.

However, on two 5-million row tables you're going to get abysmal performance. Here are some ideas to speed it up. I bet that you can get this down to under an hour with the right combination of strategies.

  1. Split the update into batches (small pieces, looping through the entire set). While this sounds counter to the normal database advice of "don't loop, use sets" it really isn't: you're just using smaller sets, not looping at the row level. The best way to batch an update like this is to "walk the clustered index." I'm not sure if that term makes sense in the DBMS you're using, but essentially it means to choose the chunks you update during each loop based on the order they will be found in the table object that you're updating. PK_ID sounds like it is the candidate to use, but if the raw table data is not ordered by this column then it will become more complicated. In T-SQL a batching loop might look like this:

    DECLARE
       @ID int,
       @Count int
    
    
    SET @ID = 1
    SET @Count = 1
    
    
    WHILE @Count > 0 BEGIN
       UPDATE N
       SET N.2ND_ID = O.2ND_ID
       FROM
          NEW_TABLE AS N
          INNER JOIN OLD_TABLE AS O ON N.PK_ID = O.PK_ID
       WHERE
          N.2ND_ID <> O.2ND_ID
          AND N.3RD_ID IS NOT NULL  
          AND O.3RD_ID IS NULL  
          AND N.CODE IS NOT NULL  
          AND O.CODE IS NULL
          AND N.PK_ID BETWEEN @ID AND @ID + 4999
       SET @Count = @@RowCount
       SET @ID = @ID + 5000
    END
    

    This example assumes that your PK_ID column is densely packed, that each update will truly hit 5000 rows. If that's not the case, then switch to a method using TOP 5000 and either output the updated PK_IDs into a table, or find the @StartID and @EndID for the next update in one step, then perform it.

    In my experience good batch sizes tend to be between 1000 and 20000 rows. In MS-SQL server, the sweet spot seems to be just below a number that forces a switch from a seek to a scan (because eventually, the db engine assumes that a single scan is cheaper than a multitude of seeks, though it is often wrong when dealing with 5 million-row tables).

  2. Select the IDs and data to be updated into a working/temp table first, then join to that. The idea is to take the hit of the huge scan with a simple INSERT statement, then add indexes to the temp table and perform the update without needing a complicated WHERE clause. Once the table contains only the rows to be updated and the columns required, not only can the WHERE clause lose most of its conditions, but the temp table has many fewer rows and many more rows per page (because it has no extraneous columns), which will greatly improve performance. This can even be done in stages where a "shadow" of the New table is created, then a "shadow" of the Old table, then the join between them, and finally a join back to the New table to update it. While this sounds like a lot of work, I think you will be surprised at the totally-crazy speed of completion this can offer.

  3. Anything you can do to convert the read from the old table to be seeks instead of scans will help. Anything you can do to reduce the amount of disk used to hold temp data (such as gigantic hash tables for 5 million rows) will help.

Emtucifor
+1 for batch approach (partitioning). Updating very large sets may create huge transactional/locking overhead and/or I/O activity.
Arvo
+1 also for the partitioning. A single transaction of 5 milion rows is a nightmare for the database engine. Most probably the memory buffers are full and the changes are flushed to disk before knowing if the Tx will be committed or rollbacked.
Lluis Martinez
For some strange reason SAS Proc SQL doesn't allow joins with updates. Such a huge handicap.
Ville Koskinen
@Ville Koskinen: Are you sure it doesn't work with some slightly different syntax than I gave here?
Emtucifor
@Emtucifor, unfortunately so, see eg. http://bit.ly/bnYXgU. However, if the data is on a rdbms, you can of course pass sql to the database (in which case your suggestions are very valid). And if the data is native SAS, there are good non-sql methods for updating data.
Ville Koskinen
@Ville Koskinen: Thanks for clarifying.
Emtucifor
A: 

You can also try putting the new table in a different device than the old one, to take advantage of its parallelism. If you can convince the DBA, of course.

Lluis Martinez
+1  A: 

All of the answers so far are firmly oriented in the SQL part of your question, but neglect the SAS part to some extent. I would strongly recommend trying a data step update/modify/merge instead of proc sql for this kind of update. It should be possible to sort both tables and apply similar logic from your SQL to ensure that the correct rows/columns are updated.

I've seen similar kinds of updates run in a matter of minutes on 20 million or more rows.

Also, check out http://runsubmit.com , a SAS specific stackoverflow style site, for more SAS specific answers.

Disclosure: I'm a SAS employee. I have nothing to do with runsubmit, which is independently run.

Rog
Thank you. I finally bit the bullet and broke this waaaaaaaaaay down into as small as possible data sets. Merge and then finally update. It is running smoothly now. Unfortunately, it is having to do these types of "Extra" data steps, that I dis-like using SAS. My code is now over a screen long to perform a simple update statement. If I had to do all over again, I would just use composite hashes through the whole program.
Andy