views:

165

answers:

2

Hi All, I used tf/idf to calculate consine similarity between two documents. It has some limitation and does not perform very well.

I looked for LDA (latent dirichlet allocation) to calculate document similarity. I don't know much about this. I couldn't find much stuff too about my problem.

Can you please provide me any tutorial related to my problem? Or can you give some advices how can i achive this task with LDA???

Thanks

P.S: also is there any source code availabe to perform such task with LDA??

A: 

You might be thinking of LSA (Latent Semantic Analysis) which is a very common solution to this kind of problem.

Pace
Hi pace,Thanks for your reply. Yes i know about LSA and I also implemented it. I used JAMA package for SVD but I had a problem that if my rows are less than columsn it doesn't work :(. Can you tell me any other SMALL svd package?
+1  A: 

Have you had a look at Lucene and Mahout?

This might be useful - Latent Dirichlet Allocation with Lucene and Mahout.

Binary Nerd
Thanks, can you also please answer is it possible to calculate similarity between two documents with the help of LDA? As mostly people said it can be used for un-supervised clustering :(
Sorry, i don't know enough about LDA to offer an experts answer to that, its not part of Mahout that i've used. However, my understanding of clustering is that your grouping objects based on some similarity measure, which in this case would be LDA.
Binary Nerd