views:

18

answers:

1

We have an SQL Server that gets daily imports of data files from clients. This data is interrelated and we are always scrubbing it and having to look for suspect duplicate records between these files.

Finding and tagging suspect records can get pretty complicated. We use logic that requires some field values to be the same, allows some field values to differ, and allows a range to be specified for how different certain field values can be. The only way we've found to do it is by using a cursor based process, and it places a heavy burden on the database.

So I wanted to ask if there's a more efficient way to do this. I've heard it said that there's almost always a more efficient way to replace cursors with clever JOINS. But I have to admit I'm having a lot of trouble with this one.

For a concrete example suppose we have 1 table, an "orders" table, with the following 6 fields.

order_id, customer_id product_id, quantity, sale_date, price

We want to look through the records to find suspect duplicates on the following example criteria. These get increasingly harder. 1. Records that have the same product_id, sale_date, and quantity but different customer_id's should be marked as suspect duplicates for review. 2. Records that have the same customer_id, product_id, quantity and have sale_dates within five days of each other should be marked as suspect duplicates for review 3. Records that have the same customer_id, product_id, but different quantities within 20 units, and sales dates within five days of each other should be considered suspect.

Is it possible to satisfy each one of these criteria with a single SQL Query that uses JOINS? Is this the most efficient way to do this?

A: 

If this gets much more involved, then you might be looking at a simple ETL process to do the heavy carrying for you: the load to the database should be manageable in the sense that you will be loading to your ETL environment, running tranformations/checks/comparisons and then writing your results to perhaps a staging table that outputs the stats you need. It sounds like a lot of work, but once it is setup, tweaking it is no great pain.

On the other hand, if you are looking at comparing vast amounts of data, then that might entail significant network traffic.

davek
We're looking for single SQL Statements that would run on the Server once all the data is up there in tables (temporary or otherwise).
Glenn