views:

389

answers:

2

Hi again :) I developed some java program to calculate cosine similarity on the basis of TF*IDF. It worked very well. But there is one problem.... :(

for example: If I have following two matrix and I want to calculate cosine similarity it does not work as rows are not same in length

doc 1
1 2 3
4 5 6

doc 2
1 2 3 4 5 6
7 8 5 2 4 9

if rows and colums are same in length then my program works very well but it does not if rows and columns are not in same length.

Any tips ???

+3  A: 

I'm not sure of your implementation but the cosine distance of two vectors is equal to the normalized dot product of those vectors.

The dot product of two matrix can be expressed as a . b = aTb. As a result if the matrix have different length you can't take the dot product to identify the cosine.

Now in a standard TF*IDF approach the terms in your matrix should be indexed by term, document as a result any terms not appearing in a document should appear as zeroes in your matrix.

Now the way you have it set up seems to suggest there are two different matrices for your two documents. I'm not sure if this is your intent, but it seems incorrect.

On the other hand if one of your matrices is supposed to be your query, then it should be a vector and not a matrix, so that the transpose produces the correct result.

A full explanation of TF*IDF follows:

Ok, in a classic TF*IDF you construct a term-document matrix a. Each value in matrix a is characterized as ai,j where i is the term and j is the document. This value is a combination of local, global and normalized weights (although if you normalize your documents, the normalized weight should be 1). Thus ai,j = fi,j*D/di, where fi,j is the frequency of word i in doc j, D is the document size, and di is the number of documents with term i in them.

Your query is a vector of terms designated as b. For each term bi,q in your query refers to term i for query q. bi,q = fi,q where fi,q is the frequency of term i in query q. In this case each query is a vector, and multiple queries form a matrix.

We can then calculate the unit vectors of each so that when we take the dot product it will produce the correct cosine. To achieve the unit vector we divide both the matrix a and the query b by their Frobenius norm.

Finally we can perform the cosine distance by taking the transpose of the vector b for a given query. Thus one query (or vector) per calculation. This is denoted as bTa. The final result is a vector with the scoring for each term where a higher score denotes higher document rank.

tzenes
Thanks for your answer. Yes my first matrix is query. What is the diff. between query and vector? aren't they almost same? I am using one document as query and 2nd as target. Thats the reason I am calculating tf*idf for target separately and tf*idf for query. I can not use transponse coz i don't know exactly what columns and rows number would be. This is just an example :) in question. Can you please explain little more how can I solve my problem? Should I create one tf*idf for both query and target? if yes then how will I calculate cosine?
I'll try to add to my answer
tzenes
@agazerboy was that sufficient or did you need more of an explanation?
tzenes
A: 

hi can you please send the java program how to find cosine similarity on a set of documents. my email add is [email protected]

arvindh