Hi All,
First of all,thanks for reading my question.
I used TF/IDF then on those values, I calculated cosine similarity to see how many documents are more similar. You can see the following matrix. Column names are like doc1, doc2, doc3 and rows names are same like doc1, doc2, doc3 etc. With the help of following matrix, I can see that doc1 and doc4 has 72% similarity (0.722711142). It is correct even if I see both documents they are similar. I have 1000 documents and I can see each document freq. in matrix to see how many of them are similar. I used different clustering like k-means and agnes ( hierarchy) to combine them. It made clusters. For example Cluster1 has (doc4, doc5, doc3) becoz they have values (0.722711142, 0.602301766, 0.69912109) more close respectively. But when I see manually if these 3 documents are realy same so they are NOT. :( What am I doing or should I use something else other than clustering??????
1 0.067305859 -0.027552299 0.602301766 0.722711142
0.067305859 1 0.048492904 0.029151952 -0.034714695
-0.027552299 0.748492904 1 0.610617214 0.010912109
0.602301766 0.029151952 -0.061617214 1 0.034410392
0.722711142 -0.034714695 0.69912109 0.034410392 1
P.S: The values can be wrong, it is just to give you an idea. If you have any question please do ask. Thanks