views:

486

answers:

6

I have following situation:

String a = "A Web crawler is a computer program that browses the World Wide Web internet automatically"; String b = "Web Crawler computer program browses the World Wide Web";

Is there any idea or standard algorithm to calculate the percentage of similarity?

For instance, above case, the similarity estimated by manual looking should be 90%++.

My idea is to tokenize both Strings and compare the number of tokens matched. Something like (7 tokens /1 0 tokens) * 100. But, of course, it is not effective at all for this method. Compare number of characters matched also seem to be not effective....

Can anyone give some guidelines???

Above is part of my project, Plagiarism Analyzer.

Hence, the words matched will be exactly same without any synonyms.

The only matters in this case is that how to calculate a quite accurate percentage of similarity.

Thanks a lot for any helps.

+3  A: 

That depends on your idea of similarity. Formally, you need to define a metric of what you consider “similar” strings to apply statistics to them. Usually, this is done via the hypothetical question: “how likely is it that the first string is a modified version of the first string where errors (e.g. by typing it) were introduced?”

A very simple yet effective measure for such similarity (or rather, the inverse) is the edit distance of two strings which can be computed using dynamic programming, which takes time O(*nm*) in general, where n and m are the lengths of the strings.

Depending on your usage, more elaborate measures (or completely unrelated, such as the soundex metric) measures might be required.

In your case, if you straightforwardly apply a token match (i.e. mere word count) you will never get a > 90% similarity. To get such a high similarity in a meaningful way would require advanced semantical analysis. If you get this done, please publish the paper because this is as yet a largely unsolved problem.

Konrad Rudolph
Actually the question I asked now is part of my Plagiarism Analyzer project...I've managed to perform similarity analysis where a sentence is sent for analysis.....For instance, in 10 words, 7 words are found matched...Hence, the final outcome of this would the percentage similarity which is the pain of neck for me.Can you gimme example on the regarding the O(*nm*) ?
Mr CooL
There will be definitely no mistake of spelling and so on in my case.Anyway, I would try to write one that calculate the percentage as accurate as possible...Thanks for the info..
Mr CooL
@Mr CooL: your use case probably precludes edit distance since that is always based on character similarity. For a plagiarism analyzer, John’s answer is probably the best, easy to implement solution. However, I predict a very high false-positive rate since there are only so many ways to express a technical property succinctly. Therefore, I would consider taking word order into account when calculating similarity.
Konrad Rudolph
A: 

The Longest Common Sub-sequence is a well known as a string dis-similarity metric, which is implemented in Dynamic Programming

abdelatif
+1  A: 

The problem with this question is: the similarity may be either a humanized-similarity (as you say "+- 90% similarity") or a statistical-similarity (Kondrad Rudolph's answer).

The human-similarity can never be easily calculated: for instance these three words

cellphone car message

mobile automobile post

The statistical-similarity is very low, while actually it's quite similar. Thus: it'll be hard to solve this problem, and the only think I can point you to is a Bayesian filtering or Artificial Intelligence with Bayesian networks.

Pindatjuh
+1  A: 

I second what Konrad Rudolf has already said.

Others may recommend different distance metrics. What I'm going to say accompanies those, but looks more at the problem of matching semantics.

Given what you seem to be looking for, I recommend that you apply some of the standard text processing methods. All of these have potential downfalls, so I list them in order of both application and difficulty to do well

  1. Sentence splitting. Figure out your units of comparison.
  2. stop-word removal: take out a, an, the, of, etc.
  3. bag of words percentage: what percentage of the overall words match, independent of ordering
  4. (much more aggressive) you could try synonym expansion, which counts synonyms as matched words.
John the Statistician
Thanks a lot John.I've thought about those you have mentioned.Can I ask your opinion that, I have the idea:1) Calculate the tokens (words) of each string and compare.2) I realize if the difference is in between 1 to 10, the chances of percentage is about 70 to 90.Shall I use simple if else to determine the percentage. I'm kinda running out of time for my project since the question asked here is just one part of my project.
Mr CooL
If you are truly running out of time, my advice is 1. remove the stop-words 2. calculate either/both % for bag-of-words and word-order dependent via edit dist (do whichever seems easiest to implement first) 3. create percentage thresholds (as with a simple if-else) 4. compare actual plagarized texts versus non and fix the percentages manually (if you have a lot of samples, first do this tuning on a subset of the documents, and use the rest to see how well this works). My advice is to iterate to the end as fast as you can, and then figure out how to be more sophisticated as you have time
John the Statistician
Okay. Thanks for the advices and ideas. ;)
Mr CooL
+1  A: 

One common measure is the Levenshtein distance, a special case of the string edit distance. It is also included in the apache string util library

bertolami
+3  A: 

As Konrad pointed out, your question depends heavily on what you mean by "similar". In general, I would say the following guidelines should be of use:

  • normalize the input by reducing a word to it's base form and lowercase it
  • use a word frequency list (obtainable easily on the web) and make the word's "similarity relevance" inversly proportional to it's position on the frequency list
  • calculate the total sentence similarity as an aggregated similarity of the words appearing in both sentences divided by the total similarity relevance of the sentences

You can refine the technique to include differences between word forms, sentence word order, synonim lists etc. Although you'll never get perfect results, you have a lot of tweaking possibilities and I believe that in general you might get quite valuable measures of similarity.

Tomislav Nakic-Alfirevic
Yes, including inverse weightings for frequency is a good way to go. Stop word removal is likely the first-order approximation to this.
John the Statistician
Thanks Tomislav~! I would take those advices....
Mr CooL