



It is said that through LSI, the matrices that are produced U, A and V, they bring together documents which have synonyms. For e.g. if we search for "car", we also get documents which have "automobile". But LSI is nothing but manipulations of matrices. It only takes into account the frequency, not semantics. So whats the thing behind this magic that I am missing? Please explain.


According to the Wikipedia article, "LSI is based on the principle that words that are used in the same contexts tend to have similar meanings." That is, if two words seem to be used interchangeably, they might be synonyms.

It's not infallible.

Jason Orendorff
I am not understanding that how it understands similarity between car and automobile. If it is frequency, then say some other word(that has completely different meaning) also has frequency same as "car" and "automobile", then it will falsey match that word also to car.
It looks like LSI is going to flag two words as semantically related if their frequencies are correlated, across many documents. It seems like a pretty naive way of doing it, yes. There are people trying to do something similar but going down to the sentence level to see which words are related, which offhand sounds more promising.
Jason Orendorff
Incidentally, if someone is trying to sell you LSI, *don't buy it*.
Jason Orendorff
Actually I am studying LSI for my project on search in P2P networks.

LSI basically creates a frequency profile of each document, and looks for documents with similar frequency profiles. If the remainder of the frequency profile is enough alike, it'll classify two documents as being fairly similar, even if one systematically substitutes some words. Conversely, if the frequency profiles are different, it can/will classify documents as different, even if they share frequent use of a few specific terms (e.g., "file" being related to a computer in some cases, and a thing that's used to cut and smooth metal in other cases).

LSI is also typically used with relatively large groups of documents. The other documents can help in finding similarities as well -- even if document A and B look substantially different, if document C uses quite a few terms from both A and B, it can help in finding that A and B are really fairly similar.

Jerry Coffin