views:

117

answers:

1

Hi, I am sorry, if my question sounds stupid :) Can you please recommend me any pseudo code or good algo for LSI implementation in java? I am not math expert. I tried to read some articles on wikipedia and other websites about LSI ( latent semantic indexing ) they were full of math. I know LSI is full of math. But if i see some source code or algo. I understand things more easily. That's why i asked here, because so many GURU are here ! Thanks in advance

A: 

This maybe a bit late but I always liked Sujit Pal's blog http://sujitpal.blogspot.com/2008/09/ir-math-with-java-tf-idf-and-lsi.html and I have written a bit on my site if you are interested.

The process is way less complicated than it is often written up as. And really all you need is a library that can do single value decomposition of a matrix.

If you are interested I can explain in a couple of the short take away bits:

1) you create a matrix/dataset/etc with word counts of various documents - the different documents will be your columns and the rows the distinct words.

2) Once you've created the matrix you use a library like Jama (for Java) or SmartMathLibrary (for C#) and run the single value decomposition. All this does is take your original matrix and break it up in to three different parts/matrix that essentially represent your documents, your words, and kind of a multiplier (sigma) these are called the vectors.

3) Once you have you word, document, sigma vectors you shrink them equally (k) by just copying smaller parts of the vector/matrix and then multiply them back together. By shrinking them it kind of normalizes your data and this is LSI.

here are some fairly clear resources:

http://puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html

http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf http://www.soe.ucsc.edu/classes/cmps290c/Spring07/proj/Flynn_talk.pdf

Hope this help you out a bit.

Eric

Eric Ness
Hi, I learned alot meanwhile. But still your answer is very helpfull +1. I saw sujit pall blog too. It is good but i don't agree with his results. I asked him when there is not context similarity between two documents why it is saying 100% same. He couldn't answer it. Now i am looking how can I use LDA other than LSI. Is it possible to use LDA for this purpose????