ansaurus

Question

MySQL remove duplicates from big database quick

Answer 1

+4 A:

If you can create a new table, do so with a unique key on the text1 + text2 fields. Then insert into the table ignoring errors (using the INSERT IGNORE syntax):

select * from my_tbl order by text3 desc

I think the order by text3 desc will put the NULLs last, but double check that.

Indexes on all those columns could help a lot, but creating them now could be pretty slow.

Scott Saunders 2009-10-30 20:08:26

It will put nulls last, but it doesn't satisfy the request which was "keep the first one that does not have a null in text3". to do this you will need to order by ID ASC and add a WHERE text3 IS NOT NULL to your statement.

Kevin Peno 2009-10-30 20:28:46

That's a good point. However that requirement contradicts his sample output: 2 | aaa | bbb | NULLPerhaps he'll tell us what he really wants.

Scott Saunders 2009-10-30 20:40:09

I reread his request. It appears that he doesn't care so long as, if there is a non-null, non-nulls are kept. So your example would suit well. :)

Kevin Peno 2009-10-30 21:09:15

Thanks, working. With a 1.2 mil rows it took almost 3 hours; est. 4000 rows per minute written. It leaves duplicate with biggest text3 field and that's corresponding to my database logic.

bizzz 2009-10-31 10:50:08

Answer 2

A:

DELETE FROM dups
WHERE id NOT IN(
    SELECT id FROM (
        SELECT DISTINCT id, text1, text2
            FROM dups
        GROUP BY text1, text2
        ORDER BY text3 DESC
    ) as tmp
)

This queries all records, groups by the distinction fields and orders by ID (means we pick the first not null text3 record). Then we select the id's from that result (these are good ids...they wont be deleted) and delete all IDs that AREN'T those.

Any query like this affecting the entire table will be slow. You just need to run it and let it roll out so you can prevent it in the future.

After you have done this "fix" I would apply UNIQUE INDEX (text1, text2) to that table. To prevent the posibility of duplicates in the future.

If you want to go the "create a new table and replace the old one" route. You could use the very inner select statement to create your insert statement.

MySQL specific (assumes new table is named my_tbl2 and has exactly the same structure):

INSERT INTO my_tbl2
    SELECT DISTINCT id, text1, text2, text3
            FROM dups
        GROUP BY text1, text2
        ORDER BY text3 DESC

See MySQL INSERT ... SELECT for more information.

Kevin Peno 2009-10-30 20:15:12

Sorry, both your suggestions deleting duplicates, but not choosing the right text3 field to survive (NULLs remain while there are NOT NULL alternatives)

bizzz 2009-10-31 11:06:35

Answer 3

A:

I don't have much experience with MySQL. If it has analytic functions try:

delete from my_tbl
 where id in (
     select id 
       from (select id, row_number()
                            over (partition by text1, text2 order by text3 desc) as rn
               from my_tbl
               /* optional: where text1 like 'a%'  */
             ) as t2
       where rn > 1
     )

the optional where clause makes the means you'll have to run it multiple times, one for each letter, etc. Create an index on text1?

Before running this, confirm that "text desc" will sort nulls last in MySQL.

2009-10-30 20:59:29

Sorry, Error Code : 1064 near '(partition by...'

bizzz 2009-10-31 11:10:47

I guess MySql doesn't have analytic functions. I'll try again later.

2009-11-02 21:43:51

can you run:create table dups as SELECT text1, text2 , max(case when text3 is null then 1 else 0) as has_null3 , max(case when text3 is not null then 1 else 0) as has_not_null3 , min(case when text3 is not null then id else null) as pref_id FROM my_tbl GROUP BY text1, text2 having count(*) > 1This will give us the the list of duplicated text1/2 and some of the "preferred" ids.If it takes too long, and it probably will, add "where text1 like 'a%' " or something like that.

2009-11-02 22:14:46

Answer 4

+3 A:

I believe this will do it, using on duplicate key + ifnull():

create table tmp like yourtable;

alter table tmp add unique (text1, text2);

insert into tmp select * from yourtable 
    on duplicate key update text3=ifnull(text3, values(text3));

rename table yourtable to deleteme, tmp to yourtable;

drop table deleteme;

Should be much faster than anything that requires group by or distinct or a subquery, or even order by. This doesn't even require a filesort, which is going to kill performance on a large temporary table. Will still require a full scan over the original table, but there's no avoiding that.

ʞɔıu 2009-10-30 21:26:47

Thanks, it works! 1.2 mil rows became 0.6 mil in 60 minutes, so that's around 10000 rows written per minute. Thanks for the clear explanation too! :)

bizzz 2009-10-31 10:55:38

ansaurus

tags:

views:

answers:

MySQL remove duplicates from big database quick

related questions