tags:

views:

165

answers:

6

How to clean a database from identical by a specific field rows so that only one copy of each set of identical rows is left there?

I have a table:

    CREATE TABLE table1 (field1 varchar(255), field2 varchar(255));

I'd like to purge table1 of any unnecessary copies of rows, leaving a random row from each set of rows with the same field1.

UPD: Please post MySQL-compatible commands.

+1  A: 

In MySQL:

CREATE TABLE `new_table` LIKE `table1`;
INSERT INTO `new_table` ( SELECT * FROM `table1` GROUP BY field1 );
DROP TABLE `table1`;
RENAME TABLE `new_table` TO `table1`;

This won't exactly choose a "random" duplicate row, but it may accomplish what you desire if you don't care about that.

If you have more fields that need to be unique in combination with the rest, add them to the GROUP BY clause.

EDIT: Reverted to old answer

Fragsworth
@Fragsworth: Woops, #1093 - You can't specify target table 'table1' for update in FROM clause
Alex
+1  A: 

Working off Fragsworth's answer, I'd:

  1. Create a new table: NEW_TABLE
  2. Define the field1 as the primary key
  3. Insert rows into NEW_TABLE from the old table
  4. Drop the old table
  5. Rename NEW_TABLE to whatever the old table was called

The primary key would stop rows with the same field1 value from being inserted, and be better overall for later queries.

OMG Ponies
A: 

This should do it (untested, in SQL Server):

SELECT field1, field2
INTO #temp
FROM 
   (SELECT ROW_NUMBER() OVER (PARTITION BY field1 ORDER BY NEWID()) AS __ROW, *
    FROM table1) x
WHERE x.__ROW = 1;

DELETE table1;

INSERT table1 
SELECT field1, field2
FROM #temp;
Dave Markle
The OP updated to say this needs is for mySQL
OMG Ponies
A: 

Make a new table OR add a unique key, self join, and delete all but the minimum key


New table:

So you could make a new table without dups. I imagine you thought of this already.

 CREATE TABLE new_test (field1 INTEGER, field2 INTEGER);
    INSERT INTO new_test(field1,field2) SELECT DISTINCT field1,field2 FROM test;
    DROP TABLE test;
    RENAME TABLE new_test test;

If you had a unique key, you could do a self join and identify the targets by having a unique key > than the minimum. If you didn't have such a key, you could make one:

Make unique key:

ALTER TABLE t2 ADD COLUMN (pk INTEGER NOT NULL AUTO_INCREMENT, PRIMARY KEY(pk));

Anyway, now you can do a self join and keep MIN(pk):

Self-join and delete dups:

mysql> DELETE dups.* FROM t2 AS dups
           INNER JOIN (
               SELECT field1,field2,MIN(pk) as MPK FROM t2
               GROUP BY field1,field2 HAVING COUNT(*) > 1 ) AS keep
           ON keep.field1=dups.field1
              AND keep.field2=dups.field2
              AND keep.MPK <> dups.pk;
DigitalRoss
-1 for the eye-hurting font
Andomar
Ok, made it smaller
DigitalRoss
+3  A: 

The simplest way is to make use of the MySQL-specific ALTER IGNORE command. It is unintuitive to delete rows by creating an index, but works very well. The IGNORE keyword means that when you create an index, any duplicate rows will be deleted. And, leaving the index in place that we create below will prevent any future duplicates. If you do not wish this behaviour, just drop the index after creating it.

ALTER IGNORE TABLE table1 ADD UNIQUE INDEX indexname (field1, field2)
RedFilter
+1 Read this only after posting the exact same thing:)
Andomar
A: 

You can use MYSQL's ALTER IGNORE syntax for that. The following command will remove any duplicates, and leave a random row:

alter ignore table table1 add unique index index1 (field1);

It would be wise to keep the index in place, so new duplicates cannot be added. But if you'd like, you can remove the index with:

alter table table1 drop index index1;
Andomar