I'm looking for an algorithm that finds whether two text documents are similar, where one document is included in the other document.
I thank you in advance.
I'm looking for an algorithm that finds whether two text documents are similar, where one document is included in the other document.
I thank you in advance.
You can always use diff with diffstat. The diff documentation isn't precise about the algorithm(s) it uses, but the original authors wrote a paper about it (Google for diff paper), and you can always read the source code.
For more precise answers you will need a more precise question. Are you only interested to know whether one document is a fragment of the other document? Or are you also interested in knowing whether one can be split up into pieces that each occur in the other document, in the same order? Or are you also interested to know how much material does not occur if you try to match up the material of both documents with a fast algorithm? diff will tell you all those things. Or do you want to know the absolute best matching? diff doesn't always give you that, you'll need something like Levenshtein distance. If one of the documents is much shorter than the other you can use fast string searching algorithms. Etc. Etc.