views:

183

answers:

2

Hi!

I have 2 texts (max 4000 characters) of different length. And I need to get a similarity rate based on (partial-)paraphrasing. Please note that same portion of texts can be in different position in each text (So Levenshtein is not the solution).

The comparison process should also:

  • not increase expo. with text size
  • be performance friendly. :)

It seems that the "adaptive local alignment of keywords" is a possible solution.

Do you have any implementation example? Preferred language is PHP but I can translate. :)

Do you have any other solution/idea/experience on that topic?

Thanks for your great help.

+4  A: 

Take a look at the levenshtein and similar_text functions which should make your life easier:

EDIT: @Toto has pointed out that those may not be suitable for this application, see his comments below.

karim79
levenshtein is not the solution in this case. Take the 2 sequences : 'ABC' and 'BCA'. Replace each (same) letter by a (same) word, (same) phrase or (same) paragraph. The edit distance is high even if it is a only an order difference. Also Levenshtein is a killer (on performance level).
Toto
similar_text seems to be also an edit distance base on the char level... => not the solution...
Toto
Thanks for your answer. (My question was not really clear. Sorry for the confusion). :))
Toto
@Toto - what a fool I am, I didn't even realize that it's your question :)
karim79
No no you read it correctly the first time. I edited the text to remove the confusion after I saw everybody was proposing the levenshtein solution. :)
Toto
If I understand you, you'd like to include rearrangement operations into the similarity matches?
nlucaroni
A: 

Needleman-Wunsch worked quite well for an application where I had to match names given to the same thing by different people.

jilles de wit