tags:

views:

763

answers:

4

I have a table with some repeated information: Id, Name, Lastname, Birth, PersonalKey, Personal Info, Direction, Source.

Where source tells me where the information came from.

The repeated information has unique id, and I need to erase the duped information. But, I have priority over some Source information that i need to be the one that stays and the other erased.

Other thing is that another Source information have some information that the one that i want to stay doesnt have so i need to refill the PersonalKey to the one thats going to stay and erase the repeated ones.

Table named Pruebas

---Id, Name, Firstname, Lastname, Birth, RFC, Source, PersonalKey---
---2,Juan,Garcia,Escobeddo,1983-08-04,GAED87393, DRV484930, 34233--
---3,Juan,Garcia,Escobedo,1987-08-04,GAED87393, FIN484930, --
---4,Juan,Garcia,Escobedo,1987-08-04,GAED87393, SA484930, --

As you see:

  • The IDs are unique
  • The name, firstname and lastname are repeated
  • The id 2 has a PersonalKey value, but 3 and 4 don't
    • I want the one with the 'FIN%' source to stay and the other ones erased, but first I need to make sure the row that remains gets the PersonalKey value (IOW, I don't want to lose the PersonalKey value).

Thanks in advance.

+1  A: 

I would run a cursor (with MySQL SP programming language, Java, Python, .NET) on this query:

select Name, Firstname, Lastname, count(1)
  from Pruebas
 group by Name, Firstname, Lastname
having count(1) > 1

Then, on the returned rows from the cursor, just do whatever you need to: check for the FIN% instance, check for PersonalKey's presence, and update accordingly.

For each row on the cursor, you can open a different cursor with:

select *
  from Pruebas
 where Name = the_Name
   and Firstname = the_Firstname
   and Lastname = the_Lastname

And now, you will have a inner cursor with all the rows you will modify. If it is the one you need, keep it and update it with the KEY value you mentioned. Otherwise, delete it.

In Oracle, you could accomplish what you want in one query, but I don't think that way you'll get the same performance you would with this approach.

Hope it helps.

Pablo Santa Cruz
All right, i need some more orientation.. I dont know if im correct, when i do the Cursor with SP(Wich i dont know but im reading about that), based on that query, how im gonna know wich id are the ones that count told me are repeated? and how can i work with the buble of repeated of a name?
Granger
Check the post again. I slightly edited it.
Pablo Santa Cruz
+1  A: 

I would do something like this:

CREATE TABLE Pruebas_new
SELECT * FROM Pruebas
GROUP BY name, firstname, lastname
having Source like 'FIN%';

It could be rewritten with a temporary table and overwriting what's in the original table if you need it to be faster, but that gets the data you need in the simplest way.

Adam Nelson
So the way i can do this is make a clone of my Prueba data base lets say Prueba2, and then run the query from prueba to prueba2?I dont understand the logic of the query, sorry.
Granger
i think this would create a table where the Source has FIN in it. Then you could just copy it back into the original Prueba table.
jimiyash
+1  A: 

The most straightforward solution I can think of is to copy the PersonalKey to other duplicate rows, and then delete all rows that don't match 'FIN%'.

UPDATE Pruebas p1 JOIN Pruebas p2
 ON (SOUNDEX(CONCAT(p1.Name, p2.Firstname, p3.Lastname)) 
   = SOUNDEX(CONCAT(p2.Name, p2.Firstname, p2.Lastname)))
SET p1.PersonalKey = p2.PersonalKey
WHERE p2.PersonalKey IS NOT NULL;

DELETE FROM Pruebas WHERE Source NOT LIKE 'FIN%';

I'm showing an approximate match expression for the join, using SOUNDEX().


I see from other comments you have left that you have a lot of variation and uncertainty. In this case, there's no way to automate the cleanup and de-duplication -- or at least automatic cleanup will be more complex and harder to get right than just doing it manually.


Re your comment that the query takes many hours: yes, it's really not expected to be efficient. The JOIN expression is not sargable -- that is, it cannot take advantage of an index. You could make it more efficient by adding an extra column to physically store the SOUNDEX() value of the name,firstname,lastname. Then create an index on that column.

But SOUNDEX() isn't guaranteed to find all possible misspellings anyway. You are facing a data cleanup task that cannot be automated fully. Any solution to data cleanup requires manual work.

Bill Karwin
Im sorry Bill Karwin, your answer look great, but i have a doubt, what about if some Complete names are homonyms? thats why i was wondering, the comparision with Name, Firstname, Lastname and also Birth.Thanks in advance!
Granger
I understand, but you said you have misspelled names in some cases. You need some way to match them up with approximate expressions. SOUNDEX() is a built-in function in MySQL.
Bill Karwin
The only other option is to manually clean up the names, and then you can use a simple equality comparison instead of approximation.
Bill Karwin
No, i think your solution could be good but im not speaking about the misspelled names, i just want the comparision also with Birth that its the same as the dup ones.That way the homonyms are going away with the Birth.Anyway you imagine?
Granger
It's certainly possible. You know the state of your data better than I do. One strategy is to write SQL to automate de-duplication of the easy cases, and then the remaining cases that require manual fixes will be fewer.
Bill Karwin
Sorry for the delay, but its been 16 hours from the query i made exactly like you and still is working.. Any idea on how to improve querys to big DB's? Or how to check the time or check if the query goes the right way?
Granger
A: 

Sorry for the delay in an answer. I was a bit busy the past couple of days.

Below is my answer based on the following assumptions:

1) You will be cleaning up name spelling issues via some other mechanism (you mentioned that you would clean it up using regular expressions in a comment to your original question).

2) A DUP set can be identified using Firstname, Lastname, and Birth (you mentioned this in a comment to your original question).

3) Firstname, Lastname, and Birth can not be NULL.

4) You cannot have more than one FIN record in a DUP set (you mentioned this in a comment to your original question).

If any of the above assumptions are not valid then my answer will have to be modified.

The following are the steps to take:

1) Update all FIN records to copy over the PersonalKey from the non-FIN record:

    UPDATE Pruebas p1
INNER JOIN Pruebas p2
        ON p1.Firstname = p2.Firstname
       AND p1.Lastname = p2.Lastname
       AND p1.Birth = p2.Birth
       SET p1.PersonalKey = p2.PersonalKey
     WHERE p1.Source like 'FIN%'
       AND p1.PersonalKey is null
       AND p2.PersonalKey is not null;

2) Delete all non-FIN records where we have a FIN record:

    DELETE p2
      FROM Pruebas p1
INNER JOIN Pruebas p2
        ON p1.Firstname = p2.Firstname
       AND p1.Lastname = p2.Lastname
       AND p1.Birth = p2.Birth
     WHERE p1.Source like 'FIN%'
       AND p2.Source not like 'FIN%';

At this point all the DUPs with a FIN record have been cleaned so that only the FIN record is left.

3) If we decide to stay with the DRV record for all other cases. We will need to copy the PersonalKey from another record to the DRV record:

    UPDATE Pruebas p1
INNER JOIN Pruebas p2
        ON p1.Firstname = p2.Firstname
       AND p1.Lastname = p2.Lastname
       AND p1.Birth = p2.Birth
       SET p1.PersonalKey = p2.PersonalKey
     WHERE p1.Source like 'DRV%'
       AND p1.PersonalKey is null
       AND p2.PersonalKey is not null;

4) Delete all non-DRV records where we have a DRV record:

    DELETE p2
      FROM Pruebas p1
INNER JOIN Pruebas p2
        ON p1.Firstname = p2.Firstname
       AND p1.Lastname = p2.Lastname
       AND p1.Birth = p2.Birth
     WHERE p1.Source like 'DRV%'
       AND p2.Source not like 'DRV%';

At this point all the DUPs with a DRV record have been cleaned so that only the DRV record is left.

If the only other record type is the SA record, then there should be no more DUPs left and we're done.

5) If we want to choose the record with the most information filled in or if we finished 3 and 4 and there is more than one record type still left causing DUPs. We need to copy the PersonalKey from any record in a DUP set that has it to any record that doesn't have it for all non-FIN records:

    UPDATE Pruebas p1
INNER JOIN Pruebas p2
        ON p1.Firstname = p2.Firstname
       AND p1.Lastname = p2.Lastname
       AND p1.Birth = p2.Birth
       SET p1.PersonalKey = p2.PersonalKey
     WHERE p1.Source not like 'FIN%'
       AND p1.PersonalKey is null
       AND p2.PersonalKey is not null;

6) Delete all records except for the one with the most information (as defined by the info_score computed column):

    DELETE p5
      FROM Pruebas p5
INNER JOIN (SELECT p3.Firstname
                 , p3.Lastname
                 , p3.Birth
                 , MIN(p3.Id) AS min_id
              FROM Pruebas p3
        INNER JOIN (SELECT p1.Firstname
                         , p1.Lastname
                         , p1.Birth
                         , count(*) AS c
                         , MAX((p1.Name is not null) + (p1.RFC is not null) + (p1.Source is not null) + (p1.PersonalKey is not null)) AS info_score
                      FROM Pruebas p1
                  GROUP BY p1.Firstname
                         , p1.Lastname
                         , p1.Birth 
                    HAVING count(*) > 1) p2
                ON p3.Firstname = p2.Firstname
               AND p3.Lastname = p2.Lastname
               AND p3.Birth = p2.Birth
               AND ((p3.Name is not null) + (p3.RFC is not null) + (p3.Source is not null) + (p3.PersonalKey is not null)) = p2.info_score
          GROUP BY p3.Firstname
                 , p3.Lastname
                 , p3.Birth) p4
        ON p4.Firstname = p5.Firstname
       AND p4.Lastname = p5.Lastname
       AND p4.Birth = p5.Birth
       AND p4.min_id <> p5.Id;

At this point all the DUPs have been collapsed with the PersonalKey saved if it was available and FIN records were saved if they existed otherwise either the DRV record or the record with the most information was saved.

Let me know if you have any questions with any of the above.

Hope it helps,

-Dipin

Dipin
Same as to Bill Karwin. Sorry for the delay, but its been 16 hours from the query i made exactly like you and still is working.. Any idea on how to improve querys to big DB's? Or how to check the time or check if the query goes the right way?
Granger
It's *not* the same as Bill's answer. His answer would remove all non-FIN records even if you had no FIN record for that DUP. How big is the table? What indexes do you have on it? My query should be more efficient than Bill's due to the where clause differences. PS, I hope you're running it in Test.
Dipin