tags:

views:

49

answers:

1

I have the unfortunate luck of having to deal with a db that contains duplicates of particular records, I am looking for a quick way to say "get the most populated record and update the duplicates to match it".

From there I can select distinct records and get a useful set of records.

Any ideas?

It's mainly names and addresses if that helps...

Ok lots of questions asked here so i'll add little bit more:

Firstly I want to pull the most "populated" not most "popular", this means the row with the most values that are not null.

Once I have the set (which is easy because in my case the id's match) I can then populate the missing values in the other rows.

I don't want to destroy data and i only intend to update data based on an accurate match (eg by id).

My problem at the moment is figuring out which of a set of rows has the most populated fields, having said that since posting this question I have found a different way to solve my bigger problem which is what to send to a remote server however I'm still interested to know what the solution to this might be.

Sample data might look something like this ...

id   name     addr1            addr2       ect
1    fred     1 the street     Some town   ...
1    fred     null             null        null

Given a table full of matching pairs like this I want to find the pairs then grab the one with the info in it and insert those values where there is a null in the other row.

A: 

Keep in mind that you will be potentially destroying data here. Just because a row has fewer columns filled doesn't mean that it's less accurate in the columns that are filled.

I've assumed that duplicates are determined by a column called "name". You'll need to adjust based on your definition of duplicates. Also, since you didn't give any rules on how to deal with ties for "most populated" I just chose the row with the lowest id.

UPDATE
    T1
SET
    col_1 = T2.col_1,
    col_2 = T2.col_2,
    ....
FROM
    My_Table T1
INNER JOIN My_Table T2 ON
    T2.name = T1.name AND
    T2.id =
    (
        SELECT TOP 1
            T3.id
        FROM
            My_Table T3
        WHERE
            T3.name = T1.name
        ORDER BY
            CASE WHEN col_1 IS NOT NULL THEN 1 ELSE 0 END +
            CASE WHEN col_2 IS NOT NULL THEN 1 ELSE 0 END +
            ... DESC,
            id ASC
    )

EDIT: I just reread your question and you mention, "From there I can select distinct records and get a useful set of records." If that's what you really want, then don't bother updating the other rows, just select the ones that you want in the first place and leave everything else intact:

SELECT
    T1.id,
    T1.name,
    T1.col_1,
    T1.col_2,
    ...
FROM
    My_Table T1
WHERE
    T1.id =
    (
        SELECT TOP 1
            T2.id
        FROM
            My_Table T2
        WHERE
            T2.name = T1.name
        ORDER BY
            CASE WHEN T2.col_1 IS NOT NULL THEN 1 ELSE 0 END +
            CASE WHEN T2.col_2 IS NOT NULL THEN 1 ELSE 0 END +
            ... DESC,
            T2.id ASC
    )
Tom H.
The problem here of course is that this may not filter out the right results, but I see the logic behind your thinking.
Wardy