views:

290

answers:

4

I have few thousands of records with few 100 fields in a MySQL Table.

Some records are duplicates and are marked as such. Now while I can simply delete the dupes, I want to retain any other possible valuable non-null data which is not present in the original version of the record. Hope I made sense.

For instance :

a b c d e f key dupe
--------------------
1 d c f k l 1   x
2 g   h   j 1    
3 i   h u u 2
4 u r     t 2   x

From the above sample table, the desired output is :

a b c d e f key dupe
--------------------
2 g c h k j 1
3 i r h u u 2

If you look at it closely, the duplicate is determined by using the key (it is the same for 2 records, so the one that has an 'x' for dupe field is the one to be deleted by retaining some of the fields from the dupe (like c, e values for key 1).

Please let me know if you need more info about this puzzling problem.

Thanks a tonne!

p.s : If it is not possible using MySQL, a PERL/Python script sample would be awesome! Thanks!

+1  A: 

Using code you can merge the lines and then delete the dupe.

if you want to stay inside sql code you need to run a stored procedue on every column of the table that update the source row only if it's null and the destination row is not null.

iterating on all the dupes pairs using some kind of cursor.

This is a general Idea, I hope someone here can help you with more specific code...

Dani
Merging sounds like a plan. Python may do it but I am a beginner in Python and am not so sure how to apply Python Merge to my problem. Thanks for the quick reply :)
ThinkCode
+2  A: 

If I understood you properly, you need 1) to merge 2 records 2) delete marked records (it is straightforward).

To accomplish the first task you can do use something like

UPDATE table1 t1 
INNER JOIN table1 t2 ON (t1.key = t2.key AND t2.dupe='x')
SET t1.b= IFNULL(t1.b,t2.b), t1.c=IFNULL(t1.c,t2.c), ...etc
WHERE t1.dupe IS NULL

P.S. This query assumes that any record has 0 or 1 duplicate; if you have more, the query needs to be modified.

a1ex07
The problem is there are few hundred columns. I am trying to avoid an inefficient update statement on all of those. Am hoping a better solution exists. Thank you.
ThinkCode
A: 

The answer very depend on what you actually want to do. Do you want inplace edit duplicate records, or you want create temporary table with desired result. If you want inplace edit records, i'm think you better use python to perform several sequental queryes one like a1ex07 say and then simple 'delete from ... where dupe = 'x' '. If you can use temporary table, you can use one (not very) simple mysql 'INSERT INTO ... SELECT ', merging dupe records with IFNULL or CAOLESCE and actual igonoring records with dupe = 'x'

Alexey Sviridov
Temp table is not a problem. The only problem is I don't want to update all the hundred columns manually.
ThinkCode
ok, if you don't have temporary problem then you may use something like that 'insert into tempTable (x) select coalesce (x1, (select x2 from sourceTable s2 where s2.keu = s1.key where s2.dupe = 'x') from sourceTable s1 where s1.dupe <> 'x'But if if you don't have any limitations, i'm recommend to write simple python script with 2-3 queries.
Alexey Sviridov
A Python script is very much appreciated. I have no limitations!I am trying to come up with a Python script which traverses through a list of dupes and non-dupes by checking if a field is null... Please help :)
ThinkCode
A: 
import string, os, sys
import csv
import MySQLdb
import pickle
from EncryptedFile import EncryptedFile

enc = EncryptedFile( os.getenv("HOME") + '/.py-encrypted-file')
user = enc.getValue("user")
pw = enc.getValue("pw")

db = MySQLdb.connect(host="127.0.0.1", user=user, passwd=pw,db=user)

cursor = db.cursor()
cursor2 = db.cursor()

cursor.execute("select * from delThisTable")
rows = cursor.fetchall()
data = dict()
for row in rows:
    key, primary = row[0], row[1]
    if key not in data:
        data[key] = list(row[:-1])
    else:
        for i in range(len(row)-1):
            if data[key][i] is None or (not primary and row[i] is not None):
                data[key][i] = row[i]

Answer by Messa in this thread...

ThinkCode