ansaurus

Question

What is the best way to perform a manipulation with huge amounts of data in SQL Server ?

Answer 1

A:

If you are moving from a many to one (many A to one B) relationship into a one to one (one A to one B), then it seems to me that the simplest route would be to create the fields in A to support this then do a simple update on A to copy the values from B into it.

This way you get rid of B altogether, and you can perform the change in one update query. Something like:

update tableA SET
  col1 = B.col1,
  col2 = B.col2
from tableA A
inner join tableB on (B.ID = A.B_ID)

Chris Lively 2010-03-05 16:14:50

Oh, that's not possible actually, the table B is referenced by many other tables as well..

Thomas Wanner 2010-03-05 16:19:49

@Thomas: Ok, the need for the cursor makes sense now.

Chris Lively 2010-03-05 16:21:38

See Mitchel's answer.

Chris Lively 2010-03-05 16:23:17

Answer 2

+1 A:

If you can take the operation offline you could change the recovery model of the database, make your changes, then change the recovery model back.

Overall though the transaction log is there to protect you, to allow for rollback etc, and it will get larger as you do deletes etc for tracking purposes.

NOTE: using this approach be sure to have a darn good backup first....

Mitchel Sellers 2010-03-05 16:15:22

And do not do unless you can put the datbase into single user mode.

HLGEM 2010-03-05 16:19:06

Upvoting this as the only reasonable way to approach the problem because the cursor is apparently necessary due to duplicating not just B but other related tables as well.

Chris Lively 2010-03-05 16:23:00

It is a reasonable way, but it is not the *only* reasonable way - it's completely conceivable to use some small temp tables instead of cursors.

Aaronaught 2010-03-05 16:38:11

Answer 3

+1 A:

I cant conceive of why you would want to do this. What's wrong with the current one to many relationship? Aren't you now going to have much larger tables to perform all your work against?

However given that you want to do this, first are you taking transaction log backups, how frequently? If it is less frequently than every fifteen minutes, then change that. When you backup the log, the log gets truncated, if you do not backup the log then it grows until you run out of space. Also perhaps the growth percentage you specify for the log is too small. Increase that and it might help you out as well.

You can try doing the work in SSIS but I don't know if that will help the logging issue really. It will help improve performance in doing the task though.

HLGEM 2010-03-05 16:28:35

Answer 4

+1 A:

I'm not sure how this will work on lots of rows, but give it a try:

DECLARE @TableA table (RowID int, B_ID int)
INSERT INTO @TableA VALUES (1,1)
INSERT INTO @TableA VALUES (2,1) --need to copy
INSERT INTO @TableA VALUES (3,2)
INSERT INTO @TableA VALUES (4,2) --need to copy
INSERT INTO @TableA VALUES (5,2) --need to copy
INSERT INTO @TableA VALUES (6,1) --need to copy
INSERT INTO @TableA VALUES (7,3)
INSERT INTO @TableA VALUES (8,3) --need to copy
DECLARE @TableB table (B_ID int, BValues varchar(10))
INSERT INTO @TableB VALUES (1,'one')
INSERT INTO @TableB VALUES (2,'two')
INSERT INTO @TableB VALUES (3,'three')

DECLARE @Max_B_ID int
SELECT @Max_B_ID=MAX(B_ID) FROM @TableB

--if you are using IDENTITY, turn them off here
INSERT INTO @TableB 
        (B_ID, BValues)
        --possibly capture the data to eliminate duplication??
        --OUTPUT INSERTED.tableID, INSERTED.datavalue
        --INTO @y 
    SELECT
        dt.NewRowID, dt.BValues
        FROM (SELECT 
                  RowID, a.B_ID
                      ,@Max_B_ID+ROW_NUMBER() OVER(order by a.B_ID) AS NewRowID,b.BValues
                  FROM (SELECT
                            RowID, B_ID
                            FROM (SELECT 
                                      RowID, a.B_ID, ROW_NUMBER() OVER(PARTITION by a.B_ID order by a.B_ID) AS RowNumber
                                      FROM @TableA a
                                 ) dt
                            WHERE dt.RowNumber>1
                       )a
                      INNER JOIN @TableB  b ON a.B_ID=b.B_ID
             ) dt


UPDATE aa
    SET B_ID=NewRowID
    FROM @TableA   aa
        INNER JOIN (SELECT
                        dt.NewRowID, dt.BValues,dt.RowID
                        FROM (SELECT 
                                  RowID, a.B_ID
                                      ,@Max_B_ID+ROW_NUMBER() OVER(order by a.B_ID) AS NewRowID,b.BValues
                                  FROM (SELECT
                                            RowID, B_ID
                                            FROM (SELECT 
                                                      RowID, a.B_ID, ROW_NUMBER() OVER(PARTITION by a.B_ID order by a.B_ID) AS RowNumber
                                                      FROM @TableA a
                                                 ) dt
                                            WHERE dt.RowNumber>1
                                       )a
                                      INNER JOIN @TableB  b ON a.B_ID=b.B_ID
                             ) dt
                   ) dt2 ON aa.RowID=dt2.RowID

SELECT * FROM @TableA
SELECT * FROM @TableB

OUTPUT:

RowID       B_ID
----------- -------
1           1
2           4
3           2
4           6
5           7
6           5
7           3
8           8

(8 row(s) affected)

B_ID        BValues
----------- -------
1           one
2           two
3           three
4           one
5           one
6           two
7           two
8           three

(8 row(s) affected)

KM 2010-03-05 16:46:10

Answer 5

A:

This is what I do:

Create a query that returns the data from the two tables (A, B) exactly as it needs to be in the final table (C) and put that into the ExtractData.sql file:

select
    A.id,
    A.xxx,
    A.yyy,
    B.*
from
   A

   JOIN B
     on B.id = A.id

Then in a cmd window, execute this command to extract the data to a file:

sqlcmd.exe -S [Server] -U [user] -P [pass] -d [dbname] -i DataExtract.sql -s "|" -h -1 -W -o ExtractData.dat

To avoid filling your logs, try setting the DB recovery mode to simple before the insert:

ALTER DATABASE [database name] SET RECOVERY SIMPLE

Then do a TRUNCATE TABLE C (if you need to clean out old data - it doesn't add to the logs like deletes).

Then in a cmd window, execute this command to bulk load the data into table C:

bcp.exe dbname.dbo.C in ExtractData.dat -S [Server] -U [user] -P [pass] -t "|" -e ExtractData.err -r \n -c

Error records will show up in the ExtractData.err file, so if you need to adjust the schema of table C you can adjust / truncate / re-load the extracted data so you don't need to run the query every time.

then set the recovery mode back to FULL after you are done:

ALTER DATABASE [database name] SET RECOVERY FULL

Ron Savage 2010-03-05 16:59:11

Answer 6

+1 A:

Here's another way to do this in a batch (no cursors). @KM's looks like it should work but it looks a little slow/scary to me with lots of locking and scans involved; if you restrict the working set to only the new rows then it should be pretty fast.

Here's the setup script for the test data:

CREATE TABLE Colors
(
    ColorID int NOT NULL IDENTITY(1, 1) PRIMARY KEY,
    ColorName varchar(50) NOT NULL
)

CREATE TABLE Markers
(
    MarkerID int NOT NULL IDENTITY(1, 1) PRIMARY KEY,
    MarkerName varchar(50) NOT NULL,
    ColorID int NOT NULL,
    CONSTRAINT FK_Markers_Colors FOREIGN KEY (ColorID)
        REFERENCES Colors (ColorID)
)

INSERT Colors (ColorName) VALUES ('Red')
INSERT Colors (ColorName) VALUES ('Green')
INSERT Colors (ColorName) VALUES ('Blue')

INSERT Markers (MarkerName, ColorID) VALUES ('Test1', 1)
INSERT Markers (MarkerName, ColorID) VALUES ('Test2', 1)
INSERT Markers (MarkerName, ColorID) VALUES ('Test3', 1)
INSERT Markers (MarkerName, ColorID) VALUES ('Test4', 2)
INSERT Markers (MarkerName, ColorID) VALUES ('Test5', 2)
INSERT Markers (MarkerName, ColorID) VALUES ('Test6', 3)
INSERT Markers (MarkerName, ColorID) VALUES ('Test7', 3)

So we have a 1:Many and we want to make this a 1:1. To do this, first queue up a list of updates (we'll index this over some other set of unique columns to speed up merging later):

CREATE TABLE #NewColors
(
    MarkerID int NOT NULL,
    ColorName varchar(50) NOT NULL,
    Seq int NOT NULL,
    CONSTRAINT PK_#NewColors PRIMARY KEY (MarkerID)
)

CREATE INDEX IX_#NewColors
ON #NewColors (ColorName, Seq);

WITH Refs AS
(
    SELECT
        MarkerID,
        ColorID,
    ROW_NUMBER() OVER (PARTITION BY ColorID ORDER BY (SELECT 1)) AS Seq
    FROM Markers
)
INSERT #NewColors (MarkerID, ColorName, Seq)
SELECT r.MarkerID, c.ColorName, r.Seq - 1
FROM Refs r
INNER JOIN Colors c
    ON c.ColorID = r.ColorID
WHERE r.Seq > 1

The result will have one row for every marker that needs to get a new colour. Then insert the new colours and capture the full output:

DECLARE @InsertedColors TABLE
(
    ColorID int NOT NULL PRIMARY KEY,
    ColorName varchar(50) NOT NULL
)

INSERT Colors (ColorName)
OUTPUT inserted.ColorID, inserted.ColorName
INTO @InsertedColors
    SELECT ColorName
    FROM #NewColors nc;

And finally merge it (here's where that extra index on the temp table comes in handy):

WITH InsertedColorSeq AS
(
    SELECT
        ColorID, ColorName,
        ROW_NUMBER() OVER (PARTITION BY ColorName ORDER BY ColorID) AS Seq
    FROM @InsertedColors
),
Updates AS
(
    SELECT nc.MarkerID, ic.ColorID AS NewColorID
    FROM #NewColors nc
    INNER JOIN InsertedColorSeq ic
    ON ic.ColorName = nc.ColorName
    AND ic.Seq = nc.Seq
)
MERGE Markers m
USING Updates u
    ON m.MarkerID = u.MarkerID
WHEN MATCHED THEN
    UPDATE SET m.ColorID = u.NewColorID;

DROP TABLE #NewColors

This should be very efficient because it only ever has to query the production tables once. Everything else will be operating on the relatively small data in the temp tables.

Test the results:

SELECT m.MarkerID, m.MarkerName, c.ColorID, c.ColorName
FROM Markers m
INNER JOIN Colors c
    ON c.ColorID = m.ColorID

Here's our output:

MarkerID     MarkerName   ColorID   ColorName
1            Test1        1         Red
2            Test2        6         Red
3            Test3        7         Red
4            Test4        2         Green
5            Test5        5         Green
6            Test6        3         Blue
7            Test7        4         Blue

This should be what you want, right? No cursors, no serious ugliness. If it chews up too much memory or tempdb space then you can replace the temp table / table variable with an indexed physical staging table. Even with several million rows, there's no way this should fill up the transaction log and crash.

Aaronaught 2010-03-05 18:03:43

I'd be scared to make large #temp tables, it depends on how large they are, but OP never really says. I can't imagine running this in anything other than single user mode, so locking will not be an issue. Also, I don't think MERGE is available in SQL Server 2005.

KM 2010-03-05 18:20:29

@KM: Ah, I missed that this is SQL 2005. You don't really need `MERGE` semantics here, it can be replaced with an `UPDATE FROM`. OP says that the production tables are 1M rows so I imagine that the temp tables would be far smaller; if that's not the case, then I did highlight the possibility of using physical staging tables instead. For 1M rows this shouldn't be a big deal at all as long as the indexing is good, maybe a couple of minutes at most to run the query. Anyway, it's obvious that we don't have all the info, I just provided an alternative version that *might* be more efficient.

Aaronaught 2010-03-05 22:16:20

ansaurus

tags:

views:

answers:

What is the best way to perform a manipulation with huge amounts of data in SQL Server ?

related questions