Hi there
I'm trying to script Dice's Coefficient, but I'm having a bit of a problem with the array intersection.
def bigram(string)
  string.downcase!
  bgarray=[]
  bgstring="%"+string+"#"
  bgslength = bgstring.length
  0.upto(bgslength-2) do |i|
    bgarray << bgstring[i,2]
   end
   return bgarray
 end
def approx_string_match(teststring, refstring)
  test_bigram = bigram(teststring) #.uniq
  ref_bigram = bigram(refstring)   #.uniq
  bigram_overlay = test_bigram & ref_bigram
  result = (2*bigram_overlay.length.to_f)/(test_bigram.length.to_f+ref_bigram.length.to_f)*100
  return result
end
The problem is, as & removes duplicates, I get stuff like this:
string1="Almirante Almeida Almada"
string2="Almirante Almeida Almada"
puts approx_string_match(string1, string2) => 76.0%
It should return 100.
The uniq method nails it, but there is information loss, which may bring unwanted matches in the particular dataset I'm working.
How can I get an intersection with all duplicates included?