views:

46

answers:

1

I already have 80 million records inserted into a table, but need to ensure a few columns are jointly unique. However, the columns already contain non-unique data, so ALTER TABLE doesn't work.

I'd like either a query that will let me easily delete records that are non-unique, while keeping one of them, or one that will allow me to load the data from the current table into a new one, while filtering for uniqueness.

+4  A: 

The query you're looking for is:

select distinct on (my_unique_1, my_unique_2) * from my_table;

This selects one row for each combination of columns within distinct on. Actually, it's always the first row. It's rarely used without order by since there is no reliable order in which the rows are returned (and so which is the first one).

Combined with order by you can choose which rows are the first (this leaves rows with the greatest last_update_date):

 select distinct on (my_unique_1, my_unique_2) * 
 from my_table order by my_unique_1, my_unique_2, last_update_date desc;

Now you can select this into a new table:

 create table my_new_table as
 select distinct on (my_unique_1, my_unique_2) * 
 from my_table order by my_unique_1, my_unique_2, last_update_date desc;

Or you can use it for delete, assuming row_id is a primary key:

 delete from my_table where row_id not in (
     select distinct on (my_unique_1, my_unique_2) row_id 
     from my_table order by my_unique_1, my_unique_2, last_update_date desc);
Konrad Garus
+1 DISTINCT ON is a very handy PostgreSQL feature
leonbloy
About "the first row": Without an ORDER BY, there is no way to tell which row will come back first, so the "first row" is a misleading term as you may not always get the same result. A DISTINCT ON is pretty much useless without an ORDER BY clause.
Matthew Wood
Thanks, updated to make this more explicit.
Konrad Garus
I read about distinct, but I tried use it with `Limit 1000` as well, just to check the output. Was taking forever, but I assume that's because I had to remove indexes temporarily to insert more data quickly. Thanks for the clear example, but I'm confused about the `my_unique` columns after `distinct on`. The docs say that those should be expressions, so does including the columns as expressions just make sure they are present in the record? I ask because I actually need to make sure those columns are not just present, but jointly unique.
ehsanul