Similarity of two texts (adaptive local alignment of keywords?)

views:

183

answers:

+2 Q:

Similarity of two texts (adaptive local alignment of keywords?)

Hi!

I have 2 texts (max 4000 characters) of different length. And I need to get a similarity rate based on (partial-)paraphrasing. Please note that same portion of texts can be in different position in each text (So Levenshtein is not the solution).

The comparison process should also:

not increase expo. with text size
be performance friendly. :)

It seems that the "adaptive local alignment of keywords" is a possible solution.

Do you have any implementation example? Preferred language is PHP but I can translate. :)

Do you have any other solution/idea/experience on that topic?

Thanks for your great help.

+4 A:

Take a look at the levenshtein and similar_text functions which should make your life easier:

EDIT: @Toto has pointed out that those may not be suitable for this application, see his comments below.

karim79 2009-08-19 12:11:01

levenshtein is not the solution in this case. Take the 2 sequences : 'ABC' and 'BCA'. Replace each (same) letter by a (same) word, (same) phrase or (same) paragraph. The edit distance is high even if it is a only an order difference. Also Levenshtein is a killer (on performance level).

Toto 2009-08-19 12:24:53

similar_text seems to be also an edit distance base on the char level... => not the solution...

Toto 2009-08-19 12:36:11

Thanks for your answer. (My question was not really clear. Sorry for the confusion). :))

Toto 2009-08-19 12:40:45

@Toto - what a fool I am, I didn't even realize that it's your question :)

karim79 2009-08-19 12:42:13

No no you read it correctly the first time. I edited the text to remove the confusion after I saw everybody was proposing the levenshtein solution. :)

Toto 2009-08-19 12:57:53

If I understand you, you'd like to include rearrangement operations into the similarity matches?

nlucaroni 2009-08-19 13:54:54

Needleman-Wunsch worked quite well for an application where I had to match names given to the same thing by different people.

jilles de wit 2009-08-19 12:31:15

ansaurus

tags:

views:

answers:

Similarity of two texts (adaptive local alignment of keywords?)

related questions