views:

712

answers:

3

Hi

Currently I have two larger vectors of 50+ strings

I want to be able to compare these two Vectors and work out how similar they are. I think I need to use Cosine similarity?

Does anyone know of any methods that take in two Java Vectors and gives a value between 0 and 1 as to how similar they are?

Thanks Phil

+2  A: 

Have a look at the similarity function in Lucene.

the above formula is motivated by the cosine-distance or dot-product between document and query vector

Here's a Previous SO question on this topic.

Joel
+2  A: 

See the Apache Mahout library for implementations of Cosine Distance and related approaches. Also consider looking up Locality Sensitive Hashing for a much speedier alternative.

bmargulies
+1  A: 

Do the following

package com.example;

import java.util.Collection;
import java.util.HashMap;
import java.util.Map;

/** Computes the similarity between two bags of words.
 * 1.0 is most similar, 0.0 is most unsimilar.
 *
 */
public class Cosine {

    public static double cosine(Collection<String> a, Collection<String> b) {
     Map<String,Integer> aa = asBag(a);
     Map<String,Integer> bb = asBag(b);
     double sum = 0;
     for (String word: aa.keySet()) {
      if (!bb.containsKey(word)) continue;
      sum += aa.get(word) * bb.get(word);
     }
     return sum / (norm(aa) * norm(bb));
    }

    private static double norm(Map<String, Integer> bag) {
     double sum = 0;
     for (int each: bag.values()) sum += each * each;
     return Math.sqrt(sum);
    }

    private static Map<String,Integer> asBag(Collection<String> vector) {
     Map<String,Integer> bag = new HashMap<String,Integer>();
     for (String word: vector) {
      if (!bag.containsKey(word)) bag.put(word,0);
      bag.put(word, bag.get(word) + 1);
     }
     return bag;
    }

}

Type inference, anyone?

Adrian
excellenet! will give it a try now, thanks
Phil
I just wrote it like that, beware it's untested.
Adrian