How do I append only distinct records from a master table to another table, when the master may have duplicates. Example - I only want the distinct records in the smaller table but I need to insert/append records to what I already have in the smaller table.
You could use the distinct
keyword to filter out duplicates:
insert into AnotherTable
(col1, col2, col3)
select distinct col1, col2, col3
from MasterTable
Based on Microsoft SQL Server and its Transact-SQL. Untested as always and the target_table has the same amount of rows as the source table (otherwise use columnnames between INSERT INTO
and SELECT
INSERT INTO target_table
SELECT DISTINCT row1, row2
FROM source_table
WHERE NOT EXISTS(
SELECT row1, row2
FROM target_table)
Something like this would work for SQL Server (you don't mention what RDBMS you're using):
INSERT INTO table (col1, col2, col3)
SELECT DISTINCT t2.a, t2.b, t2.c
FROM table2 AS t2
WHERE NOT EXISTS (
SELECT 1
FROM table
WHERE table.col1 = t2.a AND table.col2 = t2.b AND table.col3 = t2.c
)
Tune where appropriate, depending on exactly what defines "distinctness" for your table.
Ignoring any concurency issues:
insert into smaller (field, ... )
select distinct field, ... from bigger
except
select field, ... from smaller;
You can also rephrase it as a join:
insert into smaller (field, ... )
select distinct b.field, ...
from bigger b
left join smaller s on s.key = b.key
where s.key is NULL
If you don't like NOT EXISTS
and EXCEPT
/MINUS
(cute, Remus!), you have also LEFT JOIN
solution:
INSERT INTO smaller(a,b)
SELECT DISTINCT master.a, master.b FROM master
LEFT JOIN smaller ON smaller.a=master.a AND smaller.b=master.b
WHERE smaller.pkey IS NULL
You don't say the scale of the problem so I'll mention something I recently helped a friend with.
He works for an insurance company that provides supplemental Dental and Vision benefits management for other insurance companies. When they get a new client they also get a new database that can have 10's of millions of records. They wanted to identify all possible dupes with the data they already had in a master database of 100's of millions of records.
The solution we came up with was to identify two distinct combinations of field values (normalized in various ways) that would indicate a high probability of a dupe. We then created a new table containing MD5 hashes of the combos plus the id of the master record they applied to. The MD5 columns were indexed. All new records would have their combo hashes computed and if either of them had a collision with the master the new record would be kicked out to an exceptions file for some human to deal with it.
The speed of this surprised the hell out of us (in a nice way) and it has had a very acceptable false-positive rate.