views:

105

answers:

2

I have a model called HeroStatus with the following attributes:

  • id
  • user_id
  • recordable_type
  • hero_type (can be NULL!)
  • recordable_id
  • created_at

There are over 100 hero_statuses, and a user can have many hero_statuses, but can't have the same hero_status more than once.

A user's hero_status is uniquely identified by the combination of recordable_type + hero_type + recordable_id. What I'm trying to say essentially is that there can't be a duplicate hero_status for a specific user.

Unfortunately, I didn't have a validation in place to assure this, so I got some duplicate hero_statuses for users after I made some code changes. For example:

user_id = 18
recordable_type = 'Evil'
hero_type = 'Halitosis'
recordable_id = 1
created_at = '2010-05-03 18:30:30'

user_id = 18
recordable_type = 'Evil'
hero_type = 'Halitosis'
recordable_id = 1
created_at = '2009-03-03 15:30:00'

user_id = 18
recordable_type = 'Good'
hero_type = 'Hugs'
recordable_id = 1
created_at = '2009-02-03 12:30:00'

user_id = 18
recordable_type = 'Good'
hero_type = NULL
recordable_id = 2
created_at = '2009-012-03 08:30:00'

(Last two are not a dups obviously. First two are.) So what I want to do is get rid of the duplicate hero_status. Which one? The one with the most-recent date.

I have three questions:

  1. How do I remove the duplicates using a SQL-only approach?

  2. How do I remove the duplicates using a pure Ruby solution? Something similar to this: http://stackoverflow.com/questions/2790004/removing-duplicate-objects.

  3. How do I put a validation in place to prevent duplicate entries in the future?

A: 

Sometimes you need to just roll up your sleeves and do some serious SQL to kill off all the ones you don't want. This is easy if it's a one shot thing, and not too hard to roll into a Rake task you can fire on demand.

For instance, to select all the distinct status records, it is reasonable to use something like the following:

SELECT id FROM hero_statuses GROUP BY user_id, hero_type, recordable_id

Given that these are the sufficiently unique records in your set, you can go about removing all the ones you don't want:

DELETE FROM hero_statuses WHERE id NOT IN (SELECT id FROM hero_statuses GROUP BY user_id, hero_type, recordable_id)

As with any operation that involves DELETE FROM, I hope you don't just fire this off on your production data without the usual precautions of backing things up.

As to how to prevent this in the future, if these are unique constraints, create a unique index on them:

add_index :hero_statuses, [ :user_id, :hero_type, :recordable_id ], :unique => true

This will generate ActiveRecord exceptions when you attempt to introduce a duplicate record. One benefit of a unique index is that you can make use of the "INSERT IGNORE INTO..." or "INSERT ... ON DUPLICATE KEY ..." features to recover from potential duplications.

tadman
AFAIK, standard SQL doesn't allow selecting columns which you don't have in your GROUP clause. Where does this work except in MySQL?
Mladen Jablanović
Cld I do something like this in the model for validation? validates_uniqueness_of :user_id, :scope => :recordable_type, :hero_type, :recordable_idIt would have to work so that multiple users can have the same hero_status, but an individual user cannot have more than one occurrence of the hero_status.
keruilin
It was a MySQL question so I wasn't too worried about the GROUP BY columns not being present in the query. Another RDBMS may not support that, but I believe this is being embraced by other platforms.
tadman
As far as validation scopes go, don't forget to enclose them in square brackets and it should work: :scope => [ :x, :y, :z ]
tadman
A: 

For an SQL only approach, I would use this query - (I'm assuming the id's are unique.)

DELETE FROM HeroStatus WHERE id IN
(SELECT id FROM 
   (SELECT user_id, recordable_type, hero_type, recordable_id, MAX(created_at)
     GROUP BY del.user_id, recordable_type, hero_type, recordable_id
     HAVING Count(id)>1) AS del 
      INNER JOIN HeroStatus AS hs ON
      hs.user_id=del.user_id AND hs.recordable_type=del.recordable_type 
       AND hs.hero_type=del.hero_type AND hs.recordable_id=del.recordable_id 
       AND hs.created_at = del.created_at)

A bit of a monster! The query finds all duplicates using the natural key (user_id, recordable_type, hero_type) and selects the one with the largest created_at value (most recently created). It then finds the IDs of those rows (by joining back to the main table) and deletes rows with that id.

(Please try this on a copy of the table first and verify you get the results you want! :-)

To prevent this happening in future, add a unique index or constraint over the columns user_id, recordable_type, hero_type, recordable_id. E.g.

ALTER TABLE HeroStatus 
ADD UNIQUE (user_id, recordable_type, hero_type, recordable_id)

EDIT:

You add (and remove) this index within a migration like this:

add_index(:HeroStatus, [:user_id, :recordable_type, :hero_type, :recordable_id], :unique => true)
remove_index(:HeroStatus, :column => [:user_id, :recordable_type, :hero_type, :recordable_id], :unique => true)

Or, if you want to explicitly name it:

add_index(:HeroStatus, [:user_id, :recordable_type, :hero_type, :recordable_id], :unique => true, :name => :my_unique_index)
remove_index(:HeroStatus, :name => :my_unique_index)
mdma
How do I give this a name so that I can remove it in the rollback migration?
keruilin
I'm sorry I can't answer that - I don't know Ruby. Perhaps someone else can step in?
mdma
I updated the answer.
Mladen Jablanović