views:

178

answers:

2

I have a set of rows that contain duplicate entries because the data originates from multiples sources. I also have a separate reference table that indicates the priority of those data sources.

Does anyone have good tips for the most effective t-SQL to deduplicate this list?

Basically I have:

SELECT a.*, b.priority 
FROM tableA as a
JOIN tableB as b
ON b.sourceId = a.sourceId

I have been placing this into a temp table and then deleting in an odd way that I suppose could be more efficient.

DELETE ta
FROM #tmp ta
JOIN #tmp tb
ON ta.duplicateId = tb.duplicateId
WHERE ta.priority < tb.priority

Thanks for the help!

I think I see where you are going... Table A has the same columns for the two sources, but the data can differ - so they may have different prices. The challenge is that I must take the price (and all other info) from the row that comes from the source with the highest priority. To complicate matters, I do not have data for every item from ALL sources.

So, item 1 may have data from source A and B, while item 2 may only have it from source B and C. Thus, the deletion needs to happen on a per-unique item basis.

+1  A: 

I think you could do something like this:

SELECT a.*, b.priority 
FROM tableA as a
JOIN tableB as b
ON b.sourceId = a.sourceId and b.priority = (select max(priority) from tableB where b.sourceId = a.sourceId)

I can't remember if tSql will have a in scope for the subquery or not though.

Jon
It does have scope at that point - Can you explain to me how that nested select is joining to properly get the right priority?
jkelley
Sure so this query will run for every row of tableA so the result will be the highest for the sourceId of the row it's being run on it. It's like running a function on every row that's returned.
Jon
A: 

Try using ROW_NUMBER to find the ones you want to keep, and then getting rid of the rest.

...and remember to end the previous statement with a semi-colon...

with t as (
SELECT a.*, row_number() over (partition by a.sourceid order by b.priority desc) as priorityorder
FROM tableA as a
JOIN tableB as b
ON b.sourceId = a.sourceId
)
--select * from t
delete t 
where priorityorder > 1;

Rob

Rob Farley