views:

2408

answers:

5

Hello!

To compute the similarity between two documents, I create a feature vector containing the term frequencies. But then, for the next step, I can't decide between "Cosine similarity" and "Hamming distance".

My question: Do you have experience with these algorithms? Which one gives you better results?

In addition to that: Could you tell me how to code the Cosine similarity in PHP? For Hamming distance, I've already got the code:

function check ($terms1, $terms2) {
    $counts1 = array_count_values($terms1);
    $totalScore = 0;
    foreach ($terms2 as $term) {
        if (isset($counts1[$term])) $totalScore += $counts1[$term];
    }
    return $totalScore * 500 / (count($terms1) * count($terms2));
}

I don't want to use any other algorithm. I would only like to have help to decide between both.

And maybe someone can say something to how to improve the algorithms. Will you get better results if you filter out the stop words or common words?

I hope you can help me. Thanks in advance!

+5  A: 

I apologize for ignoring the fact that you said that you didn't want to use any other algorithms, but seriously, Levenshtein distance and Damerau-Levenshtein distance are way more freakin' useful than Hamming distance. Here's a D-L distance implementation in PHP, and if you don't like PHP's native levenshtein() function, which I think you won't because it has a length limit, here's a non-length-limited version:

function levenshtein_distance($text1, $text2) {
    $len1 = strlen($text1);
    $len2 = strlen($text2);
    for($i = 0; $i <= $len1; $i++)
        $distance[$i][0] = $i;
    for($j = 0; $j <= $len2; $j++)
        $distance[0][$j] = $j;
    for($i = 1; $i <= $len1; $i++)
        for($j = 1; $j <= $len2; $j++)
            $distance[$i][$j] = min($distance[$i - 1][$j] + 1, $distance[$i][$j - 1] + 1, $distance[$i - 1][$j - 1] + ($text1[$i - 1] != $text2[$j - 1]));
    return $distance[$len1][$len2];
}
chaos
Thanks. I think you misunderstood something. I DON'T WANT to use only Hamming distance. I would like to apply it to the feature vector of the text, not the text itself. So I would say that it's more useful than levenshtein, isn't it? ;) But thank you for the code, I'm sure that it's useful for lots of users for other purposes.
Oops. I did fail to absorb the feature vector part. Never mind. :) Since you like the code, I'll leave the answer undeleted. I hope the downvoters will have mercy. :)
chaos
Yes, they have. There are more upvoters than downvoters. ;)
levenshtein should be use to calculate the edit distance. so it depends of the need. "ANNA FRED" and "FRED ANNA". Lenvenshtein will give a high number but for the cosine similarity (for the words) it will be 100% similar. Similar or not? It depends of your needs.
Toto
+7  A: 

Unless I'm mistaken, I think you've got an algorithm halfway between the two algorithms. For Hamming distance, use:

function check ($terms1, $terms2) {
    $counts1 = array_count_values($terms1);
    $totalScore = 0;
    foreach ($terms2 as $term) {
        if (isset($counts1[$term])) $totalScore += 1;
    }
    return $totalScore * 500 / (count($terms1) * count($terms2));
}

(Note that you're only adding 1 for each matched element in the token vectors.)

And for cosine similarity, use:

function check ($terms1, $terms2) {
    $counts1 = array_count_values($terms1);
    $counts2 = array_count_values($terms2);
    $totalScore = 0;
    foreach ($terms2 as $term) {
        if (isset($counts1[$term])) $totalScore += $counts1[$term] * $counts2[$term];
    }
    return $totalScore / (count($terms1) * count($terms2));
}

(Note that you're adding the product of the token counts between the two documents.)

The main difference between the two is that cosine similarity will yield a stronger indicator when two documents have the same word multiple times in the documents, while Hamming distance doesn't care how often the individual tokens come up.

Edit: just noticed your query about removing function words etc. I do advise this if you're going to use cosine similarity - as function words are quite frequent (in English, at least), you might skew a result by not filtering them out. If you use Hamming distance, the effect will not be quite as great, but it could still be appreciable in some cases. Also, if you have access to a lemmatizer, it will reduce the misses when one document contains "galaxies" and the other contains "galaxy", for instance.

Whichever way you go, good luck!

Mike
Thank you very much! So if I'm using a combination of both algorithms, does it also combine their advantages? Than it would be better than these algorithms, right? :) Or should I better use one of your code examples? Your last sentence is quite interesting. So cosine similarity would be better for my purpose, right? Since it expresses a stronger relation between two texts if one word appears often, doesn't it?
@marco92w: i think cosine similarity is best in this case - also see my recent edit about function words. your intuition was dead on there.
Mike
Thx, the edit is informative, too. Last question: :) What's the difference between cosine similarity and my algorithm (code in question)? Which one is better?
Your code, as it stands, favors document one - you are ignoring the token count from document two. Ideally, a similarity algorithm should be symmetric. If you apply the algorithm from A to B, you should get the same result as B to A. If you tried that with your code, you'd get different results. Now, if you have a "preferred" document, your code might just be what you're looking for!
Mike
My algorithm doesn't favor or prefer any text. See this example: http://paste.bradleygill.com/index.php?paste_id=9911 So there must be another difference!?
Try it with text1 = "the" and text2 = "the the". You'll get a factor of 2 difference.
Mike
Sorry, I don't know what you mean. Your example always gives back 500. So what's the problem?
Maybe I'm misunderstanding your code - is $counts1[$term] a hash that returns the token count for a given term?
Mike
$counts1 contains all the words of a text as the keys and the number of their occurrences as the values. This becomes clear if you see what array_count_values() does. So maybe you're really misunderstanding my code!?So what's better? My code (halfway between the two algorithms) or the cosine similarity?
No, that's exactly what I thought. But look at your code - what happens when "the" occurs once in text 1, but twice in text 2? The $totalScore variable would be different. Are you uniquing your $terms2 list? That's the only place left for confusion on my part. I was assuming the $terms2 list only had unique entries.
Mike
No, neither $terms1 nor $terms2 are unique. They contain duplicate values so that the new array (after array_count_values) contain different numbers as values. For example: array("the"=>1) and array("the"=>2)
oh! then we're all set - just unique the $terms(1|2) arrays after you call array_count_values, and the code i have as written will do the trick. for your purposes, i still think cosine similarity is a better measure if you're just doing text comparison. do you mind if i ask what domain you're tackling?
Mike
Yes, then your code for cosine similarity works. Thx! :) But is it better than my "halfway between the two algorithms" code? What are the differences?
There is something strange in this cosine similarity function. Shouldn't the result be 1 in this case: echo check(array('a', 'b', 'c'), array('a', 'b', 'c')); instead I get 0.333 which btw is the same result as: echo check(array('a', 'b', 'c'), array('a', 'b'));
Toto
Toto is correct. The vector norm calculations are incorrect for both distance functions.
outis
I wrote a little function to calculate the cosine similarity (Bottom of this page). Hope this helps. :)
Toto
A: 

Even though hamming distance is easier to program, use Levenstein distance instead...

Also allows you to compare strings of different lenght

daniel
+4  A: 

Hi,

A Hamming distance should be done between two strings of equal length and with the order taken into account.

As your documents are certainly of different length and if the words places do not count, cosine similarity is better (please note that depending your needs, better solutions exist). :)

Here is a cosine similarity function of 2 arrays of words:

function cosineSimilarity($tokensA, $tokensB)
{
    $a = $b = $c = 0;
    $uniqueTokensA = $uniqueTokensB = array();

    $uniqueMergedTokens = array_unique(array_merge($tokensA, $tokensB));

    foreach ($tokensA as $token) $uniqueTokensA[$token] = 0;
    foreach ($tokensB as $token) $uniqueTokensB[$token] = 0;

    foreach ($uniqueMergedTokens as $token) {
        $x = isset($uniqueTokensA[$token]) ? 1 : 0;
        $y = isset($uniqueTokensB[$token]) ? 1 : 0;
        $a += $x * $y;
        $b += $x;
        $c += $y;
    }
    return $b * $c != 0 ? $a / sqrt($b * $c) : 0;
}

It is fast (isset() instead of in_array() is a killer on large arrays).

As you can see, the results does not take into account the "magnitude" of each the word.

I use it to detect multi-posted messages of "almost" copy-pasted texts. It works well. :)

The best link about string similarity metrics: http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

For further interesting readings:

http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html http://bioinformatics.oxfordjournals.org/cgi/content/full/22/18/2298

Toto
Thank you very much. :) But isn't Mike's solution (selected answer) better? The code is shorter and seems to be as fast as yours. What are the differences?
Mike's function is not really exact. Try `echo check(array('a', 'b', 'c'), array('a', 'b', 'c'));` It should return 1 (cos(0)) but his function returns 0.33. :(
Toto
Thanks alot! :)
Is your function really correct? It gives 0.71 for [1, 1, 1] and [1, 1, 0]. But http://www.miislita.com/searchito/binary-similarity-calculator.html gives 0.82?! Is it still necessary do divide the similarity value by the document length?
This tool is for *binary string* comparison. "My" function is for "documents of words". The result will not be the same. :)
Toto
Ok, thanks. I was looking for some tool to compare since I want to be sure this time I have the correct function ;)And I don't need to divide the value by the length of a document since the length doesn't play a role in cosine similarity, right?
+1  A: 

Here my corrected code for Cosine Distance fuction posted by Toto

function cosineSimilarity($tokensA, $tokensB)
{
    $a = $b = $c = 0;
    $uniqueTokensA = $uniqueTokensB = array();

    $uniqueMergedTokens = array_unique(array_merge($tokensA, $tokensB));

    foreach ($tokensA as $token) $uniqueTokensA[$token] = 0;
    foreach ($tokensB as $token) $uniqueTokensB[$token] = 0;

    foreach ($uniqueMergedTokens as $token) {
        $x = isset($uniqueTokensA[$token]) ? 1 : 0;
        $y = isset($uniqueTokensB[$token]) ? 1 : 0;
        $a += $x * $y;
        $b += pow($x,2);
        $c += pow($y,2);
    }
    return $b * $c != 0 ? $a / sqrt($b * $c) : 0;
}
LorenzoMarchi
$x (and $y) will always be 1 (The token exists) or 0 (the token does not exist). In this case, POW($x, 2) will always return $x. Therefore I removed it to save cpu. :)
Toto
So your versions are both correct, Lorenzo and Toto? They both work?