I have a project where I need to compare multi-chapter documents to a second document to determine their similarity. The issue is I have no idea how to go about doing this, what approaches exist or if their are any libraries available.
My first question is... what is similar? The numbers of words that match, the number of consecutive words that match?
I could see writing a parser that puts each document into an array with the word and location and then comparing them.
I saw the earlier question at http://stackoverflow.com/questions/220187/algorithms-or-libraries-for-textual-analysis-specifically-dominant-words-phras
however, it seems somewhat different than what I am attempting to do.
Any options or pointers people may have would be great!