views:

79

answers:

2

I need to quickly implement a read-only database containing data pulled from two identically structured live databases.

The live dbs are actually company dbs from a Dynamics accounting system so I'm happy for any Dynamics specific advice but this is mostly a SQL question. It's a fairly old version of Dynamics from before Great Plains was acquired by Microsoft. This is on SQL Server 2000.

We have reports and applications which access the Dynamics data. These apps are designed to look at one company db. Now we need to add another. It's appropriate that most of these reports and apps see combined data. They don't really care which company an order or invoice exists in. They only look at a small number of the tables.

It seems to me that the simplest solution is to create a reports only db with combined data. Preferably, we need an efficient way to update this db with changes several times a day.

I'm a developer, not a db expert but here's my plan:

Create the combined reporting db with the required tables initially with the same table structure as the live dbs.

All Dynamics tables seem to have an int identity column called DEX_ROW_ID. I'm not sure what it's used for, (it's not indexed) but that seems like the obvious generic way to uniquely identify rows. On the reporting db I will change it to a normal int (not an identity). I will create a unique index on DEX_ROW_ID in all dbs.

Dynamics does not have timestamps so I will add a timestamp column to tables in the live dbs and a corresponding binary(8) column in the reporting db. I'm assuming and hoping that Dynamics won't be upset by the additional index and column.

Add an int CompanyId column to the reporting db tables and add it to the end of any unique indexes. Most data will be naturally unique even without that. ie, order and invoice numbers etc will be different for the two live dbs. We may need to make some minor changes to the applications but I'm not expecting to do much other than point them to the new reporting db.

Assuming my reporting db is called Reports, the live dbs are Live1 and Live2, the timestamp column is called TS and all dbs are on the same server ... here's my first attempt at an update script for copying the changes in one table called MyTable in Live1 to the reporting db.

USE Reports

CREATE TABLE #Changes
(
ReportId int,
LiveId int
)

/* Collect in a temp table the ids or rows which have been deleted or changed
in the live db L.DEX_ROW_ID will be null if the row has been deleted */

INSERT INTO #Changes
SELECT R.DEX_ROW_ID, L.DEX_ROW_ID
FROM MyTable R LEFT OUTER JOIN Live1.dbo.MyTable L ON L.DEX_ROW_ID = R.DEX_ROW_ID
WHERE R.CompanyId = 1 AND L.DEX_ROW_ID IS NULL OR L.TS <> R.TS

/* Delete rows that have been deleted or changed on the live db 
I wonder if using join syntax would run better than the subquery. */
DELETE FROM MyTable
WHERE CompanyId = 1 AND DEX_ROW_ID IN (SELECT ReportId FROM #Changes)

/* Recopy rows that have changed in the live db */
INSERT INTO MyTable
SELECT 1 AS CompanyId, * FROM Live1.dbo.MyTable L
WHERE L.DEX_ROW_ID IN (SELECT ReportId FROM #Changes WHERE LiveId IS NOT NULL)

/* Copy the rows that are new in the live db */
INSERT INTO MyTable
SELECT 1 AS CompanyId, * FROM Live1.dbo.MyTable
WHERE DEX_ROW_ID > (SELECT MAX(DEX_ROW_ID) FROM MyTable WHERE CompanyId = 1)

Then do the same for the Live2 db. Repeat for every table in Reports. I know I should use a parameter @CompanyId instead of the literal but I can't do that for the live db name some I might generate these dynamically with a C# program or something.

I'm looking for any advice, suggestions or critique on what I'm doing here. I know it won't be atomic. Things could be happening on the live db while this script runs. I think we can live with that. We'll probably do a full copy either nightly or weekly when nothing is happening on the live dbs.

We need to favor performance over elegance or perfection. Some initial testing has the first query with the TS comparisons running at about 30 seconds for the biggest table so I'm optimistic that this is going to work but I'd also like to know if I'm missing something obvious or not seeing the forest for the trees.

We don't really want to deal with log files on the reporting db. Can we just set that to simple recovery model and forget about logs?

Thanks

A: 

The last thing I'd want to do is write a custom update script. Try these bulletproof methods first:

  1. Let's hope your production databases are backed up. Restore those backups every night to the reporting server. You can automate restores with the RESTORE command, which will work with a file on a network server.
  2. Use SQL Server replication to push data from the live servers to the backend.
  3. Schedule a DTS package every night to import the entire production database.

This might seem like brute force. But since you're copying a 2000-era database, brute force cannot be a problem with today's hardware. As an added advantage, these methods can be supported by a sysadmin instead of a developer.

Method 1 has the added added advantage of serving as backup verification. :)

Andomar
I think Backup and Restore will not work...as I understand the question, he has TWO production dbs and wants the data from both of them in ONE reporting db.If it would be only one production db, I would have suggested log shipping (that's what we do to "maintain" a reporting db).
haarrrgh
Yeah but you can restore both production databases to a single reporting server. Call them ProdDb1 and ProdDb2. You can then combine them in an intermediate step (the Transform step of the Extract-Transform-Load method of business intelligence.)
Andomar
Andomar, I think the ETL step is really what I'm implementing here. I'm not sure that duplicating the live dbs first really helps much.
tetranz
As far as I know, ETL is 3 steps, and the first step is just importing the data in whatever format you can get it :-)
Andomar
+1  A: 

I think there are a couple open questions here.

  1. Do you need these reports to be near-real-time? Or is this this sort of reporting that could live with daily updates? But assume you need up-to-the-minute data.

  2. Have you considered querying the databases directly and merging the data per-report on the fly? You'll have to do a lot of reporting to duplicate the effort that's going to go into designing, creating, and supporting a real-time merged replicated database.

  3. Thirty seconds is (IMHO) unacceptable for any single query against a production database. There could be any number of tuning-related reasons for taking this long, but it at least means you're going to need serious professional SQL Server optimization resources (i.e. people). And if this is a problem for the queries for reports, it doesn't bode well for the queries to maintain a separate database for reporting.

  4. Tuck into the back of your mind the consideration that, if you need to consolidate to a single database, it's worth considering whether you should make it an OLAP database rather than a mirror. The mirror will be quicker and easier, but the OLAP would be far more flexible and powerful in the long term; and it might be well to go the whole way from the beginning.

le dorfier
le dorfier,1) No, they don't really need to be real time. For most of the data, we could live with a daily update although more frequent would be nice. For the exceptions to that, we could probably do something special. It just seemed that with the right indexes etc, the update process will be fairly quick so there seemed to be no reason not to run it several times during the working day.
tetranz
And your 2). Yes, we've thought about that. Me calling it a "Reporting" db doesn't tell the whole story. A few inhouse applications also need to query it. They use a mix of stored procs, embedded SQL etc. Those queries join to tables in the production Dynamics db. Modifying all those to use a UNION to merge data from both dbs is not an attractive thought. Knowing what I do about the data and the queries, I think most will work fine unchanged if they query a db with merged data so if we go with the merged db the only change needed would be db name.
tetranz
3). Thanks for all your thoughts. Well ... that query is doing about a million index seeks and timestamp comparisons. I'd be surprised if much could be done to speed it up on the current hardware. That is the biggest table. I can probably speed to up by making use of the fact that only recently added rows are likely to change allowing an additional criteria to be put in the WHERE clause.4) Your point about OLAP is acknowledged, thanks. For several reasons, it's not really an option in the immediate future.
tetranz
re: #1 For accounting systems, my experience is that daily reporting is a useful maximum granularity. Users like to be able to run the same report more than once a day, but get the same data. And if you run several summary reports, it's all based on the same underlying transactions. Otherwise two sequential summaries that don't agree start to worry people.
le dorfier
If you're interested in continuing off-line, my gmail id is ledorfier
le dorfier