ansaurus

Question

how to delete duplicate entries in postgresql?

Answer 1

+4 A:

for example:

CREATE TABLE tmp ...
INSERT INTO tmp SELECT DISTINCT * FROM t;
DROP TABLE t;
ALTER TABLE tmp RENAME TO t;

just somebody 2009-11-17 02:36:47

Can you make it distinct for group of columns. Maybe "SELECT DISTINCT (t.a, t.b, t.c), * FROM t"?

gjrwebber 2009-11-17 03:02:03

DISTINCT ON (a, b, c): http://www.postgresql.org/docs/8.2/interactive/sql-select.html

just somebody 2009-11-17 03:24:28

Did the trick. Thanks.

gjrwebber 2009-11-17 06:42:01

easier to type: `CREATE TABLE tmp AS SELECT ...;`.Then you don't need to even figure out what the layout of `tmp` is. :)

Randal Schwartz 2010-02-15 23:40:06

Answer 2

+1 A:

DELETE FROM table
    WHERE something NOT IN
    (SELECT  MAX(s.table)
        FROM     something As s
        GROUP BY    s.this_thing, s.that_thing);

Secko 2009-11-17 02:38:59

That is what I am currently doing, but it is taking a very long time to run.

gjrwebber 2009-11-17 03:00:36

Answer 3

+1 A:

First, you need to decide on which of your "duplicates" you will keep. If all columns are equal, OK, you can delete any of them... But perhaps you want to keep only the most recent, or some other criterion ?

The fastest way depends on your answer to the question above, and also on the % of duplicates on the table. If you throw away 50% of your rows, you're better off doing CREATE TABLE ... AS SELECT DISTINCT ... FROM ... ; if you delete 1% of the rows, using DELETE is better.

Also for maintenance operations like this, it's generally good to set work_mem to a good chunk of your RAM : run EXPLAIN, check the number N of sorts/hashes, and set work_mem to your RAM / 2 / N. Use lots of RAM, it's good for speed. As long as you only have 1 concurrent connection...

peufeu 2009-11-22 01:31:12

Answer 4

A:

This function removes duplicates without removing indexes and does it to any table.

Usage: select remove_duplicates('mytable');

---
--- remove_duplicates(tablename) removes duplicate records from a table (convert from set to unique set)
---
CREATE OR REPLACE FUNCTION remove_duplicates(text) RETURNS void AS $$
DECLARE
  tablename ALIAS FOR $1;
BEGIN
  EXECUTE 'CREATE TEMPORARY TABLE _DISTINCT_' || tablename || ' AS (SELECT DISTINCT * FROM ' || tablename || ');';
  EXECUTE 'DELETE FROM ' || tablename || ';';
  EXECUTE 'INSERT INTO ' || tablename || ' (SELECT * FROM _DISTINCT_' || tablename || ');';
  EXECUTE 'DROP TABLE _DISTINCT_' || tablename || ';';
  RETURN;
END;
$$ LANGUAGE plpgsql;

Ole Tange 2009-12-01 18:25:55

Answer 5

+1 A:

I'm working with PostgreSQL 8.4. When I ran the proposed code, I found that it was not actually removing the duplicates. In running some tests, I found that adding the "DISTINCT ON (duplicate_column_name)" and the "ORDER BY duplicate_column_name" did the trick. I'm no SQL guru, I found this in the PostgreSQL 8.4 SELECT...DISTINCT doc.

CREATE OR REPLACE FUNCTION remove_duplicates(text, text) RETURNS void AS $$
DECLARE
  tablename ALIAS FOR $1;
  duplicate_column ALIAS FOR $2;
BEGIN
  EXECUTE 'CREATE TEMPORARY TABLE _DISTINCT_' || tablename || ' AS (SELECT DISTINCT ON (' || duplicate_column || ') * FROM ' || tablename || ' ORDER BY ' || duplicate_column || ' ASC);';
  EXECUTE 'DELETE FROM ' || tablename || ';';
  EXECUTE 'INSERT INTO ' || tablename || ' (SELECT * FROM _DISTINCT_' || tablename || ');';
  EXECUTE 'DROP TABLE _DISTINCT_' || tablename || ';';
  RETURN;
END;
$$ LANGUAGE plpgsql;

CM 2010-02-15 23:16:39

ansaurus

tags:

views:

answers:

how to delete duplicate entries in postgresql?

related questions