views:

748

answers:

5

I have a couple large tables (188m and 144m rows) I need to populate from views, but each view contains a few hundred million rows (pulling together pseudo-dimensionally modelled data into a flat form). The keys on each table are over 50 composite bytes of columns. If the data was in tables, I could always think about using sp_rename to make the other new table, but that isn't really an option.

If I do a single INSERT operation, the process uses a huge amount of transaction log space, typicalyl filing it up and prompting a bunch of hassle with the DBAs. (And yes, this is probably a job the DBAs should handle/design/architect)

I can use SSIS and stream the data into the destination table with batch commits (but this does require the data to be transmitted over the network, since we are not allowed to run SSIS packages on the server).

Any things other than to divide the process up into multiple INSERT operations using some kind of key to distribute the rows into different batches and doing a loop?

+1  A: 

You could partition your data and insert your data in a cursor loop. That would be nearly the same as SSIS batchinserting. But runs on your server.

create cursor ....
select YEAR(DateCol), MONTH(DateCol) from whatever

while ....
    insert into yourtable(...)
    select * from whatever 
    where YEAR(DateCol) = year and MONTH(DateCol) = month
end
Arthur
This was ultimately the strategy I used.
Cade Roux
+1  A: 

You could use the BCP command to load the data and use the Batch Size parameter

http://msdn.microsoft.com/en-us/library/ms162802.aspx

Two step process

  • BCP OUT data from Views into Text files
  • BCP IN data from Text files into Tables with batch size parameter
Raj More
13 seconds! *shakes tiny fist*
Chris McCall
How did you get the seconds part? It only shows me minutes
Raj More
@Raj: hover over the relative timestamps to get actual timestamps with seconds
Michael Haren
it changes to minutes and then hours then days as time goes on. pretty slick, never thought about it before
Chris McCall
+1  A: 

This looks like a job for good ol' BCP.

Chris McCall
+1  A: 

There is no pixie dust, you know that.

Without knowing specifics about the actual schema being transfered, a generic solution would be exactly as you describe it: divide processing into multiple inserts and keep track of the key(s). This is sort of pseudo-code T-SQL:

create table currentKeys (table sysname not null primary key, key sql_variant not null);
go

declare @keysInserted table (key sql_variant);
declare @key sql_variant;
begin transaction
do while (1=1)
begin
    select @key = key from currentKeys where table = '<target>';
    insert into <target> (...)
    output inserted.key into @keysInserted (key)
    select top (<batchsize>) ... from <source>
    where key > @key
    order by key;

    if (0 = @@rowcount)
       break; 

    update currentKeys 
    set key = (select max(key) from @keysInserted)
    where table = '<target>';
    commit;
    delete from @keysInserted;
    set @key = null;
    begin transaction;
end
commit

It would get more complicated if you want to allow for parallel batches and partition the keys.

Remus Rusanu
Actually, ground-up silver bullets make excellent pixie dust
Philip Kelley
ty mr Kelley, I'll keep that in mind on my next project ;)
Remus Rusanu
+2  A: 

Does the view have ANY kind of unique identifier / candidate key? If so, you could select those rows into a working table using:

SELECT key_columns INTO dbo.temp FROM dbo.HugeView;

(If it makes sense, maybe put this table into a different database, perhaps with SIMPLE recovery model, to prevent the log activity from interfering with your primary database. This should generate much less log anyway, and you can free up the space in the other database before you resume, in case the problem is that you have inadequate disk space all around.)

Then you can do something like this, inserting 10,000 rows at a time, and backing up the log in between:

SET NOCOUNT ON;

DECLARE
    @batchsize INT,
    @ctr INT,
    @rc INT;

SELECT
    @batchsize = 10000,
    @ctr = 0;

WHILE 1 = 1
BEGIN
    WITH x AS
    (
        SELECT key_column, rn = ROW_NUMBER() OVER (ORDER BY key_column)
        FROM dbo.temp
    )
    INSERT dbo.PrimaryTable(a, b, c, etc.)
        SELECT v.a, v.b, v.c, etc.
        FROM x
        INNER JOIN dbo.HugeView AS v
        ON v.key_column = x.key_column
        WHERE x.rn > @batchsize * @ctr
        AND x.rn <= @batchsize * (@ctr + 1);

    IF @@ROWCOUNT = 0
        BREAK;

    BACKUP LOG PrimaryDB TO DISK = 'C:\db.bak' WITH INIT;
END

That's all off the top of my head, so don't cut/paste/run, but I think the general idea is there.

Note that if you are taking regular database and log backups you will probably want to take a full to start your log chain over again.

Aaron Bertrand
That's funny, because before I built most of this system in the server and was still using my box to develop, I had already divided out categories of tables for recovery objectives and put them in separate schemas in the hope that the DBAs would give me multiple databases, but it was not to be...
Cade Roux
And unfortunately the keys are all over 50 bytes of composite columns, guess it's time to get it done...
Cade Roux