views:

5735

answers:

11

What's the best way to delete duplicate records in a mysql database using rails or mysql queries?

+3  A: 

What you can do is copy the distinct records into a new table by:

 select distinct * into NewTable from MyTable
TStamper
Although this won't catch records which have the same different but different unique ID's?
cliff.meyers
that is why I told him in his comment that" you need a keyfield(ID) to look for to make the record be distinct"
TStamper
Missed your comment up there :)
cliff.meyers
+1  A: 

This question is regarding SQL but i hope it helps.

Ólafur Waage
Oh cool, somehow I notice that question when I was searching.
nan
+1  A: 

New to SQL :-) This is a classic question - often asked in interviews:-) I don't know whether it'll work in MYSQL but it works in most databases -

> create table t(
>     a char(2),
>     b char(2),
>     c smallint )

> select a,b,c,count(*) from t
> group by a,b,c
> having count(*) > 1
a  b  c
-- -- ------ -----------
(0 rows affected)

> insert into t values ("aa","bb",1)
(1 row affected)

> insert into t values ("aa","bb",1)
(1 row affected)

> insert into t values ("aa","bc",1)
(1 row affected)

> select a,b,c,count(*) from t group by a,b,c having count(*) > 1
a  b  c 
-- -- ------ -----------
aa bb      1           2
(1 row affected)
The formatting is lost - but I'm pretty sure that any programmer will be able to figure it out.
I've corrected the formatting. Even though you answered how to 'select' the records where nan wanted to 'delete' them, it's way easier readable now and the answer is related to the underlying problem anyway.
Olaf
A: 

I had to do this recently on Oracle, but the steps would have been the same on MySQL. It was a lot of data, at least compared to what I'm used to working with, so my process to de-dup was comparatively heavyweight. I'm including it here in case someone else comes along with a similar problem.

My duplicate records had different IDs, different updated_at times, possibly different updated_by IDs, but all other columns the same. I wanted to keep the most recently updated of any duplicate set.

I used a combination of Rails logic and SQL to get it done.

Step one: run a rake script to identify the IDs of the duplicate records, using model logic. IDs go in a text file.

Step two: create a temporary table with one column, the IDs to delete, loaded from the text file.

Step three: create another temporary table with all the records I'm going to delete (just in case!).

CREATE TABLE temp_duplicate_models 
  AS (SELECT * FROM models 
  WHERE id IN (SELECT * FROM temp_duplicate_ids));

Step four: actual deleting.

DELETE FROM models WHERE id IN (SELECT * FROM temp_duplicate_ids);
Sarah Mei
+1  A: 

well, if it's a small table, from rails console you can do

class ActiveRecord::Base
  def non_id_attributes
    atts = self.attributes
    atts.delete('id')
    atts
  end
end

duplicate_groups = YourClass.find(:all).group_by { |element| element.non_id_attributes }.select{ |gr| gr.last.size > 1 }
redundant_elements = duplicate_groups.map { |group| group.last - [group.last.first] }.flatten
redundant_elements.each(&:destroy)
Maximiliano Guzman
Exactly what I was looking for! Rails migration finished! Cheers
bjeanes
A: 

You can use:

http://lenniedevilliers.blogspot.com/2008/10/weekly-code-find-duplicates-in-sql.html

to get the duplicates and then just delete them via Ruby code or SQL code (I would do it in SQL code but thats up to you :-)

Lennie
A: 

Here's another idea in no particular language:

rs = `select a, b, count(*) as c from entries group by 1, 2 having c > 1`
rs.each do |a, b, c|
  `delete from entries where a=#{a} and b=#{b} limit #{c - 1}`
end

Edit:

Kudos to Olaf for that "having" hint :)

krukid
A: 

If your table has a PK (or you can easily give it one), you can specify any number of columns in the table to be equal (to qualify is as a duplicate) with the following query (may be a bit messy looking but it works):

DELETE FROM table WHERE pk_id IN(
   SELECT DISTINCT t3.pk_id FROM (
       SELECT t1.* FROM table AS t1 INNER JOIN (
           SELECT col1, col2, col3, col4, COUNT(*) FROM table
           GROUP BY col1, col2, col3, col4 HAVING COUNT(*)>1) AS t2
       ON t1.col1 = t2.col1 AND t1.col2 = t2.col2 AND t1.col3 = t2.col3 AND
       t1.col4 = t2.col4)
   AS t3, (
       SELECT t1.* FROM table AS t1 INNER JOIN (
           SELECT col1, col2, col3, col4, COUNT(*) FROM table
           GROUP BY col1, col2, col3, col4 HAVING COUNT(*)>1) AS t2
       ON t1.col1 = t2.col1 AND t1.col2 = t2.col2 AND t1.col3 = t2.col3 AND
       t1.col4 = t2.col4)
   AS t4
   WHERE t3.col1 = t4.col1 AND t3.pk_id > t4.pk_id

)

This will leave the first record entered into the database, deleting the 'newest' duplicates. If you want to keep the last record, switch the > to <.

Raymond Williams
A: 

In MySql when I put something like

delete from A where IDA in (select IDA from A )

mySql said something like "you can't use the same table in the select part of the delete operation."

I've just have to delete some duplicate records, and I have succeeded with a .php program like that

<?php
...
$res = hacer_sql("SELECT MIN(IDESTUDIANTE) as IDTODELETE 
FROM `estudiante` group by `LASTNAME`,`FIRSTNAME`,`CI`,`PHONE`
HAVING COUNT(*) > 1 )");
while ( $reg = mysql_fetch_assoc($res) ) {
   hacer_sql("delete from estudiante where IDESTUDIANTE = {$reg['IDTODELETE']}");
}
?>
A: 
ravindra bhosale
A: 

Hi,

I am using alter table

ALTER IGNORE TABLE jos_city ADD UNIQUE INDEX(city);