views:

1118

answers:

5

Hi, I have a problem where I have to add a unique constraint to an existing table. This is fine except that the table has millions of rows already, and many of the rows violate the unique constraint I need to add.

What is the fastest approach to removing the offending rows? I have an sql statement which finds the duplicates and deletes them, but it is taking forever to run. Is there another way to solve this problem? Maybe backing up the table, then restoring after the constraint is added?

Thanks

+4  A: 

for example:

CREATE TABLE tmp ...
INSERT INTO tmp SELECT DISTINCT * FROM t;
DROP TABLE t;
ALTER TABLE tmp RENAME TO t;
just somebody
Can you make it distinct for group of columns. Maybe "SELECT DISTINCT (t.a, t.b, t.c), * FROM t"?
gjrwebber
DISTINCT ON (a, b, c): http://www.postgresql.org/docs/8.2/interactive/sql-select.html
just somebody
Did the trick. Thanks.
gjrwebber
easier to type: `CREATE TABLE tmp AS SELECT ...;`.Then you don't need to even figure out what the layout of `tmp` is. :)
Randal Schwartz
+1  A: 
DELETE FROM table
    WHERE something NOT IN
    (SELECT  MAX(s.table)
        FROM     something As s
        GROUP BY    s.this_thing, s.that_thing);
Secko
That is what I am currently doing, but it is taking a very long time to run.
gjrwebber
+1  A: 

First, you need to decide on which of your "duplicates" you will keep. If all columns are equal, OK, you can delete any of them... But perhaps you want to keep only the most recent, or some other criterion ?

The fastest way depends on your answer to the question above, and also on the % of duplicates on the table. If you throw away 50% of your rows, you're better off doing CREATE TABLE ... AS SELECT DISTINCT ... FROM ... ; if you delete 1% of the rows, using DELETE is better.

Also for maintenance operations like this, it's generally good to set work_mem to a good chunk of your RAM : run EXPLAIN, check the number N of sorts/hashes, and set work_mem to your RAM / 2 / N. Use lots of RAM, it's good for speed. As long as you only have 1 concurrent connection...

peufeu
A: 

This function removes duplicates without removing indexes and does it to any table.

Usage: select remove_duplicates('mytable');

---
--- remove_duplicates(tablename) removes duplicate records from a table (convert from set to unique set)
---
CREATE OR REPLACE FUNCTION remove_duplicates(text) RETURNS void AS $$
DECLARE
  tablename ALIAS FOR $1;
BEGIN
  EXECUTE 'CREATE TEMPORARY TABLE _DISTINCT_' || tablename || ' AS (SELECT DISTINCT * FROM ' || tablename || ');';
  EXECUTE 'DELETE FROM ' || tablename || ';';
  EXECUTE 'INSERT INTO ' || tablename || ' (SELECT * FROM _DISTINCT_' || tablename || ');';
  EXECUTE 'DROP TABLE _DISTINCT_' || tablename || ';';
  RETURN;
END;
$$ LANGUAGE plpgsql;
Ole Tange
+1  A: 

I'm working with PostgreSQL 8.4. When I ran the proposed code, I found that it was not actually removing the duplicates. In running some tests, I found that adding the "DISTINCT ON (duplicate_column_name)" and the "ORDER BY duplicate_column_name" did the trick. I'm no SQL guru, I found this in the PostgreSQL 8.4 SELECT...DISTINCT doc.

CREATE OR REPLACE FUNCTION remove_duplicates(text, text) RETURNS void AS $$
DECLARE
  tablename ALIAS FOR $1;
  duplicate_column ALIAS FOR $2;
BEGIN
  EXECUTE 'CREATE TEMPORARY TABLE _DISTINCT_' || tablename || ' AS (SELECT DISTINCT ON (' || duplicate_column || ') * FROM ' || tablename || ' ORDER BY ' || duplicate_column || ' ASC);';
  EXECUTE 'DELETE FROM ' || tablename || ';';
  EXECUTE 'INSERT INTO ' || tablename || ' (SELECT * FROM _DISTINCT_' || tablename || ');';
  EXECUTE 'DROP TABLE _DISTINCT_' || tablename || ';';
  RETURN;
END;
$$ LANGUAGE plpgsql;
CM