ansaurus

Question

Answer 1

+3 A:

That depends on your idea of similarity. Formally, you need to define a metric of what you consider “similar” strings to apply statistics to them. Usually, this is done via the hypothetical question: “how likely is it that the first string is a modified version of the first string where errors (e.g. by typing it) were introduced?”

A very simple yet effective measure for such similarity (or rather, the inverse) is the edit distance of two strings which can be computed using dynamic programming, which takes time O(*nm*) in general, where n and m are the lengths of the strings.

Depending on your usage, more elaborate measures (or completely unrelated, such as the soundex metric) measures might be required.

In your case, if you straightforwardly apply a token match (i.e. mere word count) you will never get a > 90% similarity. To get such a high similarity in a meaningful way would require advanced semantical analysis. If you get this done, please publish the paper because this is as yet a largely unsolved problem.

Konrad Rudolph 2010-03-06 16:01:46

Actually the question I asked now is part of my Plagiarism Analyzer project...I've managed to perform similarity analysis where a sentence is sent for analysis.....For instance, in 10 words, 7 words are found matched...Hence, the final outcome of this would the percentage similarity which is the pain of neck for me.Can you gimme example on the regarding the O(*nm*) ?

Mr CooL 2010-03-06 16:12:49

There will be definitely no mistake of spelling and so on in my case.Anyway, I would try to write one that calculate the percentage as accurate as possible...Thanks for the info..

Mr CooL 2010-03-06 16:14:14

@Mr CooL: your use case probably precludes edit distance since that is always based on character similarity. For a plagiarism analyzer, John’s answer is probably the best, easy to implement solution. However, I predict a very high false-positive rate since there are only so many ways to express a technical property succinctly. Therefore, I would consider taking word order into account when calculating similarity.

Konrad Rudolph 2010-03-06 16:17:27

Answer 2

A:

The Longest Common Sub-sequence is a well known as a string dis-similarity metric, which is implemented in Dynamic Programming

abdelatif 2010-03-06 16:07:56

Answer 3

+1 A:

The problem with this question is: the similarity may be either a humanized-similarity (as you say "+- 90% similarity") or a statistical-similarity (Kondrad Rudolph's answer).

The human-similarity can never be easily calculated: for instance these three words

cellphone car message

mobile automobile post

The statistical-similarity is very low, while actually it's quite similar. Thus: it'll be hard to solve this problem, and the only think I can point you to is a Bayesian filtering or Artificial Intelligence with Bayesian networks.

Pindatjuh 2010-03-06 16:10:12

Answer 4

+1 A:

I second what Konrad Rudolf has already said.

Others may recommend different distance metrics. What I'm going to say accompanies those, but looks more at the problem of matching semantics.

Given what you seem to be looking for, I recommend that you apply some of the standard text processing methods. All of these have potential downfalls, so I list them in order of both application and difficulty to do well

Sentence splitting. Figure out your units of comparison.
stop-word removal: take out a, an, the, of, etc.
bag of words percentage: what percentage of the overall words match, independent of ordering
(much more aggressive) you could try synonym expansion, which counts synonyms as matched words.

John the Statistician 2010-03-06 16:10:29

Thanks a lot John.I've thought about those you have mentioned.Can I ask your opinion that, I have the idea:1) Calculate the tokens (words) of each string and compare.2) I realize if the difference is in between 1 to 10, the chances of percentage is about 70 to 90.Shall I use simple if else to determine the percentage. I'm kinda running out of time for my project since the question asked here is just one part of my project.

Mr CooL 2010-03-06 16:29:55

If you are truly running out of time, my advice is 1. remove the stop-words 2. calculate either/both % for bag-of-words and word-order dependent via edit dist (do whichever seems easiest to implement first) 3. create percentage thresholds (as with a simple if-else) 4. compare actual plagarized texts versus non and fix the percentages manually (if you have a lot of samples, first do this tuning on a subset of the documents, and use the rest to see how well this works). My advice is to iterate to the end as fast as you can, and then figure out how to be more sophisticated as you have time

John the Statistician 2010-03-06 16:47:42

Okay. Thanks for the advices and ideas. ;)

Mr CooL 2010-03-06 17:14:57

Answer 5

+1 A:

One common measure is the Levenshtein distance, a special case of the string edit distance. It is also included in the apache string util library

bertolami 2010-03-06 16:12:08

Answer 6

+3 A:

As Konrad pointed out, your question depends heavily on what you mean by "similar". In general, I would say the following guidelines should be of use:

normalize the input by reducing a word to it's base form and lowercase it
use a word frequency list (obtainable easily on the web) and make the word's "similarity relevance" inversly proportional to it's position on the frequency list
calculate the total sentence similarity as an aggregated similarity of the words appearing in both sentences divided by the total similarity relevance of the sentences

You can refine the technique to include differences between word forms, sentence word order, synonim lists etc. Although you'll never get perfect results, you have a lot of tweaking possibilities and I believe that in general you might get quite valuable measures of similarity.

Tomislav Nakic-Alfirevic 2010-03-06 16:34:17

Yes, including inverse weightings for frequency is a good way to go. Stop word removal is likely the first-order approximation to this.

John the Statistician 2010-03-06 17:10:35

Thanks Tomislav~! I would take those advices....

Mr CooL 2010-03-06 17:15:26

ansaurus

tags:

views:

answers:

Percentage Similarity Analysis (Java)

related questions