views:

181

answers:

4

I periodically fetch the latest tweets with a certain hashtag and save them locally. In order to prevent saving duplicates, I use the method below. Unfortunately, it does not seem to be working... so what's wrong with this code:

    def remove_duplicates
      before = @tweets.size
      @tweets.delete_if {|tweet| !((Tweet.all :conditions => { :twitter_id => tweet.twitter_id}).empty?) }
      duplicates = before - @tweets.size
      puts "#{duplicates} duplicates found"
    end

Where @tweets is an array of Tweet objects fetched from twitter. I'd appreciate any solution that works and especially one that might be more elegant...

+2  A: 

you can validate_uniqueness_of :twitter_id in the Tweet model (where this code should be). This will cause duplicates to fail to save.

Ben Hughes
validate_uniqueness_of :twitter_id it's not a good solution. Between the time it checks the existence of the record and it creates a new record, an other process might create a duplicate. You should always use this method in conjunction with a database index.
Simone Carletti
@weppos: Since I have only one sequential job writing tweets, this is not a problem. This seems to be most "DRY" solution. Worked well on sqlite3, but on production mode/mysql it does not seem to notice duplicates... looking into it now.
effkay
for actual safety, you should put uniqueness constraints on the database and just be ready to handle any exceptions that are thrown
Ben Hughes
A: 

array.uniq!

Removes duplicate elements from self. Returns nil if no changes are made (that is, no duplicates are found).

won't help for duplicates in the database.
Ben Hughes
+1  A: 

Since it sounds like you're using the Twitter search API, a better solution is to use the since_id parameter. Keep track of the last twitter status id you got from your previous query and use that as the since_id parameter on your next query.

More information is available at Twitter Search API Method: search

Ryan McGeary
A: 

Ok, turns out the problem was a bit of different nature: When looking closer into it, I found out that multipe Tweets were saved with the twitter_id 2147483647... This is the upper limit for integer fields :)

Changing the field to bigint solved the problem. It took me very long to figure out since MySQL did silently fail and just reverted to the maximum value as long as it could. (until I added the unique index). I quickly tried it out with postgres, which returned a nice "Integer out of range" error, which then pointed me to the real cause of the problem here.

Thanks Ben for the validation and indexing tips, as they lead to much cleaner code now!

effkay