Hello!
To compute the similarity between two documents, I create a feature vector containing the term frequencies. But then, for the next step, I can't decide between "Cosine similarity" and "Hamming distance".
My question: Do you have experience with these algorithms? Which one gives you better results?
In addition to that: Could you tell me how to code the Cosine similarity in PHP? For Hamming distance, I've already got the code:
function check ($terms1, $terms2) {
$counts1 = array_count_values($terms1);
$totalScore = 0;
foreach ($terms2 as $term) {
if (isset($counts1[$term])) $totalScore += $counts1[$term];
}
return $totalScore * 500 / (count($terms1) * count($terms2));
}
I don't want to use any other algorithm. I would only like to have help to decide between both.
And maybe someone can say something to how to improve the algorithms. Will you get better results if you filter out the stop words or common words?
I hope you can help me. Thanks in advance!