views:

546

answers:

4

I have a table where I save all row-changes that have ever occurred. The problem is that in the beginning of the application there was a bug that made a bunch of copies of every row.

The table looks something like this:

copies
|ID |CID |DATA
| 1 | 1  |  DA
| 2 | 2  |  DO
| 2 | 3  |  DO (copy of CID 2)
| 1 | 4  |  DA (copy of CID 1)
| 2 | 5  |  DA
| 1 | 6  |  DA (copy of CID 1)
| 2 | 7  |  DO

CID is UNIQUE in table copies.

What I want is to remove all the duplicates of DATA GROUP BY ID that is after one another sorted by CID.

As you can see in the table, CID 2 and 3 are the same and they are after one another. I would want to remove CID 3. The same with CID 4 and CID 6; they have no ID 1 between them and are copies of CID 1.

After duplicates removal, I would like the table to look like this:

copies
|ID |CID |DATA
| 1 | 1  |  DA
| 2 | 2  |  DO
| 2 | 5  |  DA
| 2 | 7  |  DO

Any suggestions? :)

I think my question was badly asked because the answer everybody seems to think is the best gives this result:

ID   | DATA | DATA | DATA | DATA | DATA |     DATA |        CID          |
                                                   |Expected |  Quassnoi |
1809 |    1 |    0 |    1 |    0 |    0 |     NULL |  252227 |    252227 |
1809 |    1 |    0 |    1 |    1 |    0 |     NULL |  381530 |    381530 |
1809 |    1 |    0 |    1 |    0 |    0 |     NULL |  438158 | (missing) |
1809 |    1 |    0 |    1 |    0 | 1535 | 20090113 |  581418 |    581418 |
1809 |    1 |    1 |    1 |    0 | 1535 | 20090113 |  581421 |    581421 |

CID 252227 AND CID 438158 are duplicates but because CID 381530 comes between them; I want to keep this one. It's only duplicates that are directly after one another when ordering by CID and ID.

+1  A: 

You can use a count in a subquery for this:

delete from copies
where
    (select count(*) from copies s where s.id = copies.id 
                                   and s.data = copies.data 
                                   and s.cid > copies.cid) > 0
Eric
I don't think this works correctly at all. It doesn't take into account whether or not the DATA value between successive rows ordered by ID and CID.
Jonathan Leffler
It actually does. I just mislabeled `data` as `name`. My bad. Fixed!
Eric
+5  A: 
DELETE   c.*
FROM     copies c
JOIN     (
         SELECT  id, data, MIN(copies) AS minc
         FROM    copies
         GROUP BY
                 id, data
         ) q
ON       c.id = q.id
         AND c.data = q.data
         AND c.cid <> q.minc

Update:

DELETE  c.*
FROM    (
        SELECT  cid
        FROM    (
                SELECT  cid,
                        COALESCE(data1 = @data1 AND data2 = @data2, FALSE) AS dup,
                        @data1 := data1,
                        @data2 := data2
                FROM    (
                        SELECT  @data1 := NULL,
                                @data2 := NULL
                        ) vars, copies ci
                ORDER BY
                        id, cid
                ) qi
        WHERE   dup
        ) q
JOIN    copies c
ON      c.cid = q.cid

This solution empoys MySQL session variables.

There is a pure ANSI solution that would use NOT EXISTS, however, it would be slow due to the way MySQL optimizer works (it won't use range access method in a correlated subquery).

See this article in my blog for performance details for quite a close task:

Quassnoi
Thanks for the suggestion but this deletes:| 2 | 7 | DOIts not all rows that are the same that should be deleted just the ones that comes directly after oneanother grouped by ID.
Filip Palm
To do this properly, you also have to have a criterion that there is no row with a different data value and the same ID value and with a CID between the 'current row' and the 'earlier row with the same data value'. That's fiddly, at best.
Jonathan Leffler
Hi Quassnoi, i've tried thisone and it deletes nearly everything i have in the table. How should the query look like if i have more then one DATA columns?
Filip Palm
`@Filip`: see the post update
Quassnoi
Nice, thank you very much Quassnoi. At a first glance it seems to get it right. I will run some tests to verify!
Filip Palm
+1  A: 
// EDITED for @Jonathan Leffler comment
//$sql = "SELECT ID,CID,DATA FROM copies ORDER BY CID, ID";
$sql = "SELECT ID,CID,DATA FROM copies ORDER BY ID, CID";
$result = mysql_query($sql, $link); 
$data = "";
$id = "";
while ($row = mysql_fetch_row($result)){ 
       if (($row[0]!=$id) && ($row[2]!=$data) && ($id!="")){
            $sql2 = "DELETE FROM copies WHERE CID=".$row[1];
            $res = mysql_query($sql2, $link); 
       }
       $id=$row[0];
       $data=$row[2];
}
andres descalzo
Really? You've got three pure SQL approaches here, and you're suggesting using an interpreted language kicking off single `delete` transactions? I sincerely hope that's not how you solve all of your database problems. Databases are about thinking in sets!
Eric
1º is for the consultation is a process at once. 2º there is a PHP tag, so it's okay to include a solution with this language. something else?
andres descalzo
@Eric the task has to be done once for cleanup purposes, so I'd think that speed isn't a primary concern here. That makes it a valid answer, if not the best answer.
quillbreaker
1º? 2º? What are the degree marks for? (I don't mean that to be snide...is that the way that some languages write ordinal numbers?)
Beska
@andres: I think your cursor-based solution is likely to be the best way to do this (of those shown so far) - but I would expect to order the data with 'ORDER BY ID, CID'. See also my comment to the main question.
Jonathan Leffler
@Beska: yes, 1º is equivalent to 'first' in some languages.
Jonathan Leffler
@Jonathan Leffler, yes, by analyzing the code, the correct order would "ID, CID, thanks
andres descalzo
A: 

delete from copies c where c.cid in (select max(cid) as max_cid, count(*) as num from copies where num > 1 group by id, data)

lg
This will only delete the last duplicate, not all duplicates.
Quassnoi
Yes, you are right, but you can run query again until all duplicates are deleted. Your solution is more elegant (I already voted your answer)!
lg