views:

31

answers:

1

I'm looking for an algorithm that finds whether two text documents are similar, where one document is included in the other document.

I thank you in advance.

A: 

You can always use diff with diffstat. The diff documentation isn't precise about the algorithm(s) it uses, but the original authors wrote a paper about it (Google for diff paper), and you can always read the source code.

For more precise answers you will need a more precise question. Are you only interested to know whether one document is a fragment of the other document? Or are you also interested in knowing whether one can be split up into pieces that each occur in the other document, in the same order? Or are you also interested to know how much material does not occur if you try to match up the material of both documents with a fast algorithm? diff will tell you all those things. Or do you want to know the absolute best matching? diff doesn't always give you that, you'll need something like Levenshtein distance. If one of the documents is much shorter than the other you can use fast string searching algorithms. Etc. Etc.

reinierpost
hi,I'm using an algorithme based on cosine similarity with TF-IDF to find if document X is similar to document Y. But I want to know more about X and Y if they are similar, I want to know if X contains Y, I mean if the information in Y are included withing X. the most important aspect for me is the semantic of the inclusion, not only the ocuurence of the terms of documents Y in the document X.many thanks
hort
Apparently you are characterizing documents by the vectors of occurrences of keywords they contain, possibly after some normalization to weed out insignificant words, normalize word forms and spelling, etc. One idea is to just compare the sets of keywords; fuzzy comparison where you get a percentage of inclusion is probably better. My gut feeling is that attempts to introduce more knowledge of language, e.g. to recognize synonymns, won't easily benefit you more than hinder you. There's no doubt a large literature on this in the IR field but I don't know it.
reinierpost