Text similarity algorithm

views:

301

answers:

+6 Q:

Text similarity algorithm

I have two subtitles files. I need a function that tells whether they represent the same text, or the similar text

Sometimes there are comments like "The wind is blowing... the music is playing" in one file only. But 80% percent of the contents will be the same. The function must return TRUE (files represent the same text). And sometimes there are misspellings like 1 instead of l (one - L ) as here: She 1eft the baggage. Of course, it means function must return TRUE.

My comments:
The function should return percentage of the similarity of texts - AGREE

"all the people were happy" and "all the people were not happy" - here that'd be considered as a misspelling, so that'd be considered the same text. To be exact, the percentage the function returns will be lower, but high enough to say the phrases are similar

Do consider whether you want to apply Levenshtein on a whole file or just a search string - not sure about Levenshtein, but the algorithm must be applied to the file as a whole. It'll be a very long string, though.

+1 A:

Have a look at approximate grep. It might give you pointers, though it's almost certain to perform abysmally on large chunks of text like you're talking about.

EDIT: The original version of agrep isn't open source, so you might get links to OSS versions from http://en.wikipedia.org/wiki/Agrep

Chinmay Kanchi 2010-02-24 11:36:59

+2 A:

You're expecting too much here, it looks like you would have to write a function for your specific needs. I would recommend starting with an existing file comparison application (maybe diff already has everything you need) and improve it to provide good results for your input.

soulmerge 2010-02-24 11:37:59

or,render the text with a known font size (and face), and then compare pixels. that way, symbols with similar looking shape can be made to look similar, and its easier to detect that.

Chii 2010-02-24 11:42:02

@Chii but on larger symbol shifting the rest of the page would throw everything of.

Jens Schauder 2010-02-24 11:45:28

I don't think the question has anything to do with OCR, but just plain text

stillstanding 2010-02-24 12:16:02

@rockjock You're right, plain text only.

EugeneP 2010-02-24 12:44:38

+6 A:

Levenshtein algorithm: http://en.wikipedia.org/wiki/Levenshtein_distance

Anything other than a result of zero means the text are not "identical". "Similar" is a measure of how far/near they are. Result is an integer.

stillstanding 2010-02-24 11:42:51

+1: The integer result would need to be normalised to determine the similarity of the whole file. E.g. Similarity = Levenshtein Distance / Num. Characters. I would also suggest preprocessing the file to correct spelling mistakes before applying this algorithm.

Adamski 2010-02-24 11:48:09

There is an implementation of the Levenshtein distance in Apache Commons `StringUtils`: http://commons.apache.org/lang/api-2.4/org/apache/commons/lang/StringUtils.html#getLevenshteinDistance(java.lang.String, java.lang.String)

Fabian Steeg 2010-02-24 11:56:54

@Fabian: It is a builtin function in PHP: http://php.net/manual/en/function.levenshtein.php

soulmerge 2010-02-24 13:16:27

ansaurus

tags:

views:

answers:

Text similarity algorithm

related questions