views:

59

answers:

2

I am trying to search for long, approximate substrings in a large database. For example, a query could be a 1000 character substring that could differ from the match by a Levenshtein distance of several hundred edits. I have heard that indexed q-grams could do this, but I don't know the implementation details. I have also heard that Lucene could do it, but is Lucene's levenshtein algorithm fast enough for hundreds of edits? Perhaps something out of the world of plagiarism detection? Any advice is appreciated.

+1  A: 

Q-grams could be one approach, but there are others such as Blast, BlastP - which are used for Protein, nucleotide matches etc.

The Simmetrics library is a comprehensive collection of string distance approaches.

Mikos
You should also look at Cosine Similarity
Mikos
+1  A: 

Lucene does not seem to be the right tool here. In addition to Mikos' fine suggestions, I have heard about AGREP, FASTA and Locality-Sensitive Hashing(LSH). I believe that an efficient method should first prune the search space heavily, and only then do more sophisticated scoring on the remaining candidates.

Yuval F