Hi,
I'm looking for ideas on how to best match two hash tables containing string key/value pairs.
Here's the actual problem I'm facing: I have structured data coming in which is imported into the database. I need to UPDATE records which are already in the DB, however, it's possible that ANY value in the source can change, therefore I don't have a reliable ID.
I'm thinking of fuzzy matching two rows, source and DB and make an "educated" guess if it should be updated or inserted.
Any ideas would be greatly appreciated.
Solution
Solution is based on Ben Robinson's post. Works pretty well, allows to have small mismatches here and there and custom key based weights.
require 'rubygems'
require 'amatch'
class Hash
def fuzzy_match(hash, key_weights = {})
sum_total = 0
sum_weights = 0
self.keys.each do |key|
weight = key_weights[key] || 1
next if weight == 0
weight *= 10000
match = self[key].to_s.levenshtein_similar(hash[key].to_s) * weight
sum_total += match
sum_weights += weight
end
sum_total.to_f / sum_weights.to_f
end
end