Detecting similar words among n text documents

views:

answers:

+1 Q:

Detecting similar words among n text documents

Hi;

I have n documents and want to find common words that are included in these documents. For example I want to say (n-3) documents include the word "web".

Certainly I can do this by basic data structures but there maybe efficient algorithm or a way to handle same words with different suffix. Is there any algorithm for such purposes?

I am unfamiliar with datamining world. In general manner is there a term used for efforts of finding similarities between different documents? If there is then I will make my research easily.

Thanks.

+1 A:

I suppose that you are talking about stemming. If you want to use the R language, you'll have to work with the tm package.

If not, I can only suggest this list of text mining tools

gd047 2010-03-18 12:26:31

You can do it by producing a word-list with counts for each document, sorting the word-list alphabetically, and comparing two lists. This is O(n lg n).

Another approach is to use the full text search as provided by your database of choice.

Will 2010-03-18 12:30:03

ansaurus

tags:

views:

answers:

Detecting similar words among n text documents

related questions