views:

145

answers:

5

Have a new project with the following setup and requirments:-

My client has a MSSQL 2005 server (A) in their office. Their vendor has a MSSQL 2005 server (B) in another part of the world, which contains real-time transactional data. My client wants to load the data from (B) to (A) on a daily basis during non office hours. They have datareader access to (B) but that is about it, the vendor will not be doing replication, log shipping etc and my client is solely responsible for getting their own data so that they can run their own reports/cubes.

The script I used is as follows using distributed TSQL and linked server to (B) :-

DECLARE @sqlCommand        VARCHAR(2000)
DECLARE @LastProcessedDate DATETIME

-- run the following code for Table 1 to Table XX

SELECT @LastProcessedDate = LastProcessedDate 
  FROM [ProcessControl] 
 WHERE TableName = 'table_1'

SET @sqlCommand = 'INSERT INTO Table1 
                   SELECT * 
                     FROM OPENQUERY(VendorsLinkedServerName, 
                          ''SELECT * 
                              FROM Table1 
                             WHERE LastModified >= '''' + @LastProcessedDate + '''')'

EXEC @sqlCommand

I did an initial trial for the 10 largest tables for 1 full day of data and it took 1 hour which is too long. Also for the test, I have already removed all indexes and constraints except the primary key (which comprise 1-4 BIGINT columns) for the tables. Any suggestions on how I can speed up the load time or go about loading the data?

edit: just to add, in case you wonder why the select statement was written this way, in the above example Table1 in (A) is in an ETL database and the data will subsequently be compared to determine insert/update/delete in the actual reporting database in (A)

A: 

Is the performance due to network latency rather than the approach you're taking? What kind of volumes are you dealing with etc.

I note they won't do replication or log shipping but could you talk them in to do doing some scheduled bulk exports which could be compressed and sent across for an automated routine at the other end to do a bulk insert?

Chris W
yes network latency is one of the problems, but due to costs constraints my client isn't going to be get a dedicated leased line :-( volume so far (after running test for different days) seem to range from 10-100 MBTried asking the vendor for scheduled exports but hit a brick wall too.
jj
It sounds like you have very limited options then. The only other thing to suggest is trialling your query inside SSIS and playing with batch sizes etc to see if it helps speed things up. As network is the main issue there's not much you can do without getting a bit level replica tool in place but since you're not allowed a replication of log shipping I'd guess your not going to be able to try something like Double Take.
Chris W
A: 

Try the query locally (or try having them run the query locally) and see how long it takes. As Chris indicated, it could be network latency. Do you have any ability to roll out a SSIS package on the vendor's side? If so you could extract and compress the data, ship it across via an FTP task or another mechanism, and unzip/insert the data on your end.

Aaron Alton
A: 

You may want to do some tests to determine where the slowdown is, as Chris W is suggesting.

For example, do the query and dump the information to a file, and time it.

Time just running the query, and ignore any data you are getting, to see how long it takes to transfer the data.

Then you know the time to transfer and what would happen if you take your database out of the loop.

Then you can determine the best course of action.

You may also want to do lots of individual queries.

So, time how long it takes to transfer from one large table.

Then do 5 queries of large tables and see what happens.

If possible you may want to see if you can make several simultaneous connections, and d/l the data quickly, dump it into an empty database, then pay the price for indexes by copying from your local copy to your database.

But, all of this is pointless until you have some numbers to see how long the operations are taking.

James Black
+1  A: 

Unfortunately I'm guess that the biggest slowdown is just network latency, and there's not much you can do about that.

I do have one idea, though. Try adding an ORDER BY clause to the SELECT statement that matches the primary key (clustered index) of the target table. That could cut down on the need to re-order the table during inserts.

Also, just how many tables are there, and how much time do you have? If you've done the 10 largest in an hour, you may find there's a 90/10 rule at play such that all the other tables combined still take less time than those 10.

Joel Coehoorn
+1  A: 

Sounds like you're looking for a one-way (download only) synchronisation. For best reliability I would ask the vendor (B) to add a ROWVERSION column, it's a little safer than a DateTime or DateTimeOffset.

As for your query I've done something like the following:

INSERT INTO dbo.Table1
(
    Field1,
    Field2,
    Field3
)
SELECT
    T1.Field1,
    T1.Field2,
    T1.Field3
FROM [LinkedServer].[DatabaseName].[dbo].[Table1] T1
WHERE T1.Version > @LastAnchor

You can skip the whole ETL phase if your schemas are the same by having a CreatedVersion and UpdatedVersion, with Tombstone rows for deletes if needed. Sync Framework simplifies a lot of this for you, although you can use the concepts to bake your own relatively easily. Rules are as follows:

-- get inserts in dependency order
INSERT INTO ...
SELECT ...
FROM ...
WHERE CreatedVersion > @LastAnchor
-- get updates in dependency order
UPDATE [dbo].[Table1]
SET ...
FROM [LinkedServer].[DatabaseName].[dbo].[Table1] T1
WHERE [dbo].[Table1].[PK] = T1.[PK]
    AND T1.CreatedVersion <= @LastAnchor
    AND T1.UpdatedVersion > @LastAnchor
-- get deletes (if you need them)
DELETE T
FROM [dbo].[Table1] T
JOIN [LinkedServer].[DatabaseName].[dbo].[Table1_Tombstone] T1
    ON T.[PK] = T1.[PK]
    AND T1.DeletedVersion > @LastAnchor

For all these queries to perform well the CreatedVersion, UpdatedVersion and DeletedVersion columns should be indexed.

All the above logic works for DateTime or RowVersion, just that RowVersion is more accurate and there's some issues that 2005 SP2 solves around in-progress Transactions. Basically in SQL 2005 SP2 and SQL 2008 you set your max anchor to MIN_ ACTIVE_ROWVERSION() - 1 and query for stuff in-between. Look up the MSDN notes for more information on why.

Unlike some I would strongly recommend against making the UpdatedVersion a clustered index, as that will involve constant re-sorting of the data on the pages when updates are done, if you recommend that to the vendor you will look like a fool.

One advantage of using Sync Framework is that you can use WCF to do your data calls and do smaller syncs at regular intervals instead of massive ones at the end of each day. This would require the Vendor to provide or at least host the WCF service that provides database access. If you prefer you can still use a Linked Server with Sync Framework while still supporting smaller syncs done more frequently.

Timothy Walters