I have following situation:
String a = "A Web crawler is a computer program that browses the World Wide Web internet automatically"; String b = "Web Crawler computer program browses the World Wide Web";
Is there any idea or standard algorithm to calculate the percentage of similarity?
For instance, above case, the similarity estimated by manual looking should be 90%++.
My idea is to tokenize both Strings and compare the number of tokens matched. Something like (7 tokens /1 0 tokens) * 100. But, of course, it is not effective at all for this method. Compare number of characters matched also seem to be not effective....
Can anyone give some guidelines???
Above is part of my project, Plagiarism Analyzer.
Hence, the words matched will be exactly same without any synonyms.
The only matters in this case is that how to calculate a quite accurate percentage of similarity.
Thanks a lot for any helps.