views:

4509

answers:

6

Hi how do i find the cosine similarity between vectors. I need to find the similarity to measure the relatedness between two lines of text.Can someone help me with the code.what java classes and methods to use. For example i ve two sentences like 1.system for user interface and 2.user interface machine and their respective vectors after tF-idf followed by normalisation using LSI isfor example [1,0.5] and [0.5,1].how do i measure the smiliarity between these vectors .

+2  A: 

Have a look at: http://en.wikipedia.org/wiki/Cosine_similarity.

If you have vectors A and B.

The similarity is defined as:

cosine(theta) = A . B / ||A|| ||B||

For a vector A = (a1, a2), ||A|| is defined as sqrt(a1^2 + a2^2)

For vector A = (a1, a2) and B = (b1, b2), A . B is defined as a1 b1 + a2 b2;

So for vector A = (a1, a2) and B = (b1, b2), the cosine similarity is given as:

(a1 b1 + a2 b2) / sqrt(a1^2 + a2^2) sqrt(b1^2 + b2^2)

Example:

A = (1, 0.5), B = (0.5, 1)

cosine(theta) = (0.5 + 0.5) / sqrt(5/4) sqrt(5/4) = 4/5
Gamecat
hi i ve understood the process...i am looking for help with my java coding
Ok, I just leave it here as a ref for other readers.
Gamecat
+1  A: 

When I was working with text mining some time ago, I was using the SimMetrics library which provides an extensive range of different metrics in Java. If it happened that you need more, then there is always R and CRAN to look at.

But coding it from the description in the Wikipedia is rather trivial task, and can be a nice exercise.

Anonymous
+1  A: 

For matrix code in Java I'd recommend using the Colt library. If you have this, the code looks like (not tested or even compiled):

DoubleMatrix1D a = new DenseDoubleMatrix1D(new double[]{1,0.5}});
DoubleMatrix1D b = new DenseDoubleMatrix1D(new double[]{0.5,1}});
double cosineDistance = a.zDotProduct(b)/Math.sqrt(a.zDotProduct(a)*b.zDotProduct(b))

The code above could also be altered to use one of the Blas.dnrm2() methods or Algebra.DEFAULT.norm2() for the norm calculation. Exactly the same result, which is more readable depends on taste.

Nick Fortescue
+3  A: 
public class CosineSimilarity extends AbstractSimilarity {

  @Override
  protected double computeSimilarity(Matrix sourceDoc, Matrix targetDoc) {
    double dotProduct = sourceDoc.arrayTimes(targetDoc).norm1();
    double eucledianDist = sourceDoc.normF() * targetDoc.normF();
    return dotProduct / eucledianDist;
  }
}

I did some tf-idf stuff recently for my Information Retrieval unit at University. I used this Cosine Similarity method which uses Jama: Java Matrix Package.

For the full source code see IR Math with Java : Similarity Measures, really good resource that covers a good few different similarity measurements.

Mark Davidson
sounds perfect..thanks
A: 

hi i m finding cosine similarity between documents ..i did like dis

D1=(8,0,0,1) where 8,0,0,1 are the tf-idf scores of the terms t1, t2, t3 , t4

D2=(7,0,0,1)

cos(theta) = (56 + 0 + 0 + 1) / sqrt(64 + 49) sqrt(1 +1 )

which comes out to be

cos(theta)= 5

now what do i evaluate from this value...i dont get it wat does cos(theta)=5 signify about the similarity between them...pls reply ..Am i doing things right ??????????..pls do reply guys..

jaskirat
This is not a place to ask your own questions; this is where you answer this question.
Sean Owen