views:

225

answers:

6

How do I append only distinct records from a master table to another table, when the master may have duplicates. Example - I only want the distinct records in the smaller table but I need to insert/append records to what I already have in the smaller table.

A: 

You could use the distinct keyword to filter out duplicates:

insert into AnotherTable
(col1, col2, col3)
select distinct col1, col2, col3
from MasterTable
Andomar
A: 

Based on Microsoft SQL Server and its Transact-SQL. Untested as always and the target_table has the same amount of rows as the source table (otherwise use columnnames between INSERT INTO and SELECT

INSERT INTO target_table
  SELECT DISTINCT row1, row2
  FROM source_table
  WHERE NOT EXISTS(
    SELECT row1, row2
    FROM target_table)
Scoregraphic
A: 

Something like this would work for SQL Server (you don't mention what RDBMS you're using):

INSERT INTO table (col1, col2, col3)
    SELECT DISTINCT t2.a, t2.b, t2.c 
       FROM table2 AS t2
       WHERE NOT EXISTS (
          SELECT 1 
             FROM table 
             WHERE table.col1 = t2.a AND table.col2 = t2.b AND table.col3 = t2.c
       )

Tune where appropriate, depending on exactly what defines "distinctness" for your table.

Chris J
+2  A: 

Ignoring any concurency issues:

insert into smaller (field, ... )
select distinct field, ... from bigger
except
select field, ... from smaller;

You can also rephrase it as a join:

insert into smaller (field, ... )
select distinct b.field, ... 
from bigger b
left join smaller s on s.key = b.key
where s.key is NULL
Remus Rusanu
+1  A: 

If you don't like NOT EXISTS and EXCEPT/MINUS (cute, Remus!), you have also LEFT JOIN solution:

INSERT INTO smaller(a,b)
SELECT DISTINCT master.a, master.b FROM master
LEFT JOIN smaller ON smaller.a=master.a AND smaller.b=master.b
WHERE smaller.pkey IS NULL
filiprem
+1  A: 

You don't say the scale of the problem so I'll mention something I recently helped a friend with.

He works for an insurance company that provides supplemental Dental and Vision benefits management for other insurance companies. When they get a new client they also get a new database that can have 10's of millions of records. They wanted to identify all possible dupes with the data they already had in a master database of 100's of millions of records.

The solution we came up with was to identify two distinct combinations of field values (normalized in various ways) that would indicate a high probability of a dupe. We then created a new table containing MD5 hashes of the combos plus the id of the master record they applied to. The MD5 columns were indexed. All new records would have their combo hashes computed and if either of them had a collision with the master the new record would be kicked out to an exceptions file for some human to deal with it.

The speed of this surprised the hell out of us (in a nice way) and it has had a very acceptable false-positive rate.

Peter Rowell
Wow, never thought about it this way!very neat
Rippo