ansaurus

Question

Remove duplicate records/objects uniquely identified by multiple attributes

Answer 1

A:

Sometimes you need to just roll up your sleeves and do some serious SQL to kill off all the ones you don't want. This is easy if it's a one shot thing, and not too hard to roll into a Rake task you can fire on demand.

For instance, to select all the distinct status records, it is reasonable to use something like the following:

SELECT id FROM hero_statuses GROUP BY user_id, hero_type, recordable_id

Given that these are the sufficiently unique records in your set, you can go about removing all the ones you don't want:

DELETE FROM hero_statuses WHERE id NOT IN (SELECT id FROM hero_statuses GROUP BY user_id, hero_type, recordable_id)

As with any operation that involves DELETE FROM, I hope you don't just fire this off on your production data without the usual precautions of backing things up.

As to how to prevent this in the future, if these are unique constraints, create a unique index on them:

add_index :hero_statuses, [ :user_id, :hero_type, :recordable_id ], :unique => true

This will generate ActiveRecord exceptions when you attempt to introduce a duplicate record. One benefit of a unique index is that you can make use of the "INSERT IGNORE INTO..." or "INSERT ... ON DUPLICATE KEY ..." features to recover from potential duplications.

tadman 2010-05-07 20:38:24

AFAIK, standard SQL doesn't allow selecting columns which you don't have in your GROUP clause. Where does this work except in MySQL?

Mladen Jablanović 2010-05-07 21:37:32

Cld I do something like this in the model for validation? validates_uniqueness_of :user_id, :scope => :recordable_type, :hero_type, :recordable_idIt would have to work so that multiple users can have the same hero_status, but an individual user cannot have more than one occurrence of the hero_status.

keruilin 2010-05-07 22:09:56

It was a MySQL question so I wasn't too worried about the GROUP BY columns not being present in the query. Another RDBMS may not support that, but I believe this is being embraced by other platforms.

tadman 2010-05-10 14:44:12

As far as validation scopes go, don't forget to enclose them in square brackets and it should work: :scope => [ :x, :y, :z ]

tadman 2010-05-10 14:45:30

Answer 2

A:

For an SQL only approach, I would use this query - (I'm assuming the id's are unique.)

DELETE FROM HeroStatus WHERE id IN
(SELECT id FROM 
   (SELECT user_id, recordable_type, hero_type, recordable_id, MAX(created_at)
     GROUP BY del.user_id, recordable_type, hero_type, recordable_id
     HAVING Count(id)>1) AS del 
      INNER JOIN HeroStatus AS hs ON
      hs.user_id=del.user_id AND hs.recordable_type=del.recordable_type 
       AND hs.hero_type=del.hero_type AND hs.recordable_id=del.recordable_id 
       AND hs.created_at = del.created_at)

A bit of a monster! The query finds all duplicates using the natural key (user_id, recordable_type, hero_type) and selects the one with the largest created_at value (most recently created). It then finds the IDs of those rows (by joining back to the main table) and deletes rows with that id.

(Please try this on a copy of the table first and verify you get the results you want! :-)

To prevent this happening in future, add a unique index or constraint over the columns user_id, recordable_type, hero_type, recordable_id. E.g.

ALTER TABLE HeroStatus 
ADD UNIQUE (user_id, recordable_type, hero_type, recordable_id)

EDIT:

You add (and remove) this index within a migration like this:

add_index(:HeroStatus, [:user_id, :recordable_type, :hero_type, :recordable_id], :unique => true)
remove_index(:HeroStatus, :column => [:user_id, :recordable_type, :hero_type, :recordable_id], :unique => true)

Or, if you want to explicitly name it:

add_index(:HeroStatus, [:user_id, :recordable_type, :hero_type, :recordable_id], :unique => true, :name => :my_unique_index)
remove_index(:HeroStatus, :name => :my_unique_index)

mdma 2010-05-07 20:50:52

How do I give this a name so that I can remove it in the rollback migration?

keruilin 2010-05-07 22:18:21

I'm sorry I can't answer that - I don't know Ruby. Perhaps someone else can step in?

mdma 2010-05-07 22:55:00

I updated the answer.

Mladen Jablanović 2010-05-08 06:24:31

ansaurus

tags:

views:

answers:

Remove duplicate records/objects uniquely identified by multiple attributes

related questions