ansaurus

Question

Fastest way to find minimal Hamming distance to any substring?

Answer 1

A:

You're stuck as far as big-O is concerned.. At a fundamental level, you're going to need to test if every letter in the target matches each eligible letter in the substring.

Luckily, this is easily parallelized.

One optimization you can apply is to keep a running count of mismatches for the current position. If it's greater than the lowest hamming distance so far, then obviously you can skip to the next possibility.

rlbond 2009-07-17 23:01:47

-1 sorry. In fact faster algorithms do exist for this problem (see my post) though they aren't necessarily obvious.

j_random_hacker 2009-07-20 17:23:28

Answer 2

+1 A:

Modified Boyer-Moore

I've just dug up some old Python implementation of Boyer-Moore I had lying around and modified the matching loop (where the text is compared to the pattern). Instead of breaking out as soon as the first mismatch is found between the two strings, simply count up the number of mismatches, but remember the first mismatch:

current_dist = 0
while pattern_pos >= 0:
    if pattern[pattern_pos] != text[text_pos]:
        if first_mismatch == -1:
            first_mismatch = pattern_pos
            tp = text_pos
        current_dist += 1
        if current_dist == smallest_dist:
           break

     pattern_pos -= 1
     text_pos -= 1

 smallest_dist = min(current_dist, smallest_dist)
 # if the distance is 0, we've had a match and can quit
 if current_dist == 0:
     return 0
 else: # shift
     pattern_pos = first_mismatch
     text_pos = tp
     ...

If the string did not match completely at this point, go back to the point of the first mismatch by restoring the values. This makes sure that the smallest distance is actually found.

The whole implementation is rather long (~150LOC), but I can post it on request. The core idea is outlined above, everything else is standard Boyer-Moore.

Preprocessing on the Text

Another way to speed things up is preprocessing the text to have an index on character positions. You only want to start comparing at positions where at least a single match between the two strings occurs, otherwise the Hamming distance is |S| trivially.

import sys
from collections import defaultdict
import bisect

def char_positions(t):
    pos = defaultdict(list)
    for idx, c in enumerate(t):
        pos[c].append(idx)
    return dict(pos)

This method simply creates a dictionary which maps each character in the text to the sorted list of its occurrences.

The comparison loop is more or less unchanged to naive O(mn) approach, apart from the fact that we do not increase the position at which comparison is started by 1 each time, but based on the character positions:

def min_hamming(text, pattern):
    best = len(pattern)
    pos = char_positions(text)

    i = find_next_pos(pattern, pos, 0)

    while i < len(text) - len(pattern):
        dist = 0
        for c in range(len(pattern)):
            if text[i+c] != pattern[c]:
                dist += 1
                if dist == best:
                    break
            c += 1
        else:
            if dist == 0:
                return 0
        best = min(dist, best)
        i = find_next_pos(pattern, pos, i + 1)

    return best

The actual improvement is in find_next_pos:

def find_next_pos(pattern, pos, i):
    smallest = sys.maxint
    for idx, c in enumerate(pattern):
        if c in pos:
            x = bisect.bisect_left(pos[c], i + idx)
            if x < len(pos[c]):
                smallest = min(smallest, pos[c][x] - idx)
    return smallest

For each new position, we find the lowest index at which a character from S occurs in L. If there is no such index any more, the algorithm will terminate.

find_next_pos is certainly complex, and one could try to improve it by only using the first several characters of the pattern S, or use a set to make sure characters from the pattern are not checked twice.

Discussion

Which method is faster largely depends on your dataset. The more diverse your alphabet is, the larger will be the jumps. If you have a very long L, the second method with preprocessing might be faster. For very, very short strings (like in your question), the naive approach will certainly be the fastest.

DNA

If you have a very small alphabet, you could try to get the character positions for character bigrams (or larger) rather than unigrams.

Torsten Marek 2009-07-18 00:50:57

Oh right, and I'm sorry all the code examples are in Python, but that's just the language I'm most comfortable with. Please feel free to ask for clarifications!

Torsten Marek 2009-07-18 00:52:40

Personally, I think people should always post code in whatever language they are most comfortable with. If nothing else, it teach others a bit about that language.

Daniel 2009-07-18 01:25:23

+1 for clean code samples, and I'd +1 again (if allowed) for the bigram or bigger with small alphabet idea. 4-grams of DNA would be similar to 8-bit bytes, after all. You'd just need to think about the transformation from your n-gram back to the real sequence a bit.

RBerteig 2009-07-18 02:55:48

Answer 3

+5 A:

Perhaps surprisingly, this exact problem can be solved in just O(|A|nlog n) time using Fast Fourier Transforms (FFTs), where n is the length of the larger sequence L and |A| is the size of the alphabet.

Here is a freely available PDF of a paper by Donald Benson describing how it works:

Fourier methods for biosequence analysis (Donald Benson, Nucleic Acids Research 1990 vol. 18, pp. 3001-3006)

Summary: Convert each of your strings S and L into several indicator vectors (one per character, so 4 in the case of DNA), and then convolve corresponding vectors to determine match counts for each possible alignment. The trick is that convolution in the "time" domain, which ordinarily requires O(n^2) time, can be implemented using multiplication in the "frequency" domain, which requires just O(n) time, plus the time required to convert between domains and back again. Using the FFT each conversion takes just O(nlog n) time, so the overall time complexity is O(|A|nlog n). For greatest speed, finite field FFTs are used, which require only integer arithmetic.

Note: For arbitrary S and L this algorithm is clearly a huge performance win over the straightforward O(mn) algorithm as |S| and |L| become large, but OTOH if S is typically shorter than log|L| (e.g. when querying a large DB with a small sequence), then obviously this approach provides no speedup.

UPDATE 21/7/2009: Updated to mention that the time complexity also depends linearly on the size of the alphabet, since a separate pair of indicator vectors must be used for each character in the alphabet.

j_random_hacker 2009-07-20 17:22:17

FFT? That's a cool answer!

Daniel 2009-07-20 18:09:28

Not practical for my situation b/c S is very small compared to L, but an interesting enough answer to deserve an upvote anyhow.

dsimcha 2009-07-20 18:19:40

Updated to mention that the time complexity also depends linearly on the size of the alphabet.

j_random_hacker 2009-07-21 05:02:46

Thanks Daniel! Yes, FFTs are one of those things that seem kind of magical to me -- as though you shouldn't be able to calculate something that fast :) Other things in that should-be-impossible category for me are the fact that suffix trees/arrays can be built and queried in linear time :)

j_random_hacker 2009-07-21 05:05:59

Neat! @dsimcha: If S is very small — as short as log|L| letters — then why bother? Note that you have an Ω(|L|) lower bound (you need to look at all letters in L)... so are you in a state where |L| is small enough but |L|*log|L| is too large?

ShreevatsaR 2009-07-21 05:16:37

ansaurus

tags:

views:

answers:

Fastest way to find minimal Hamming distance to any substring?

Modified Boyer-Moore

Preprocessing on the Text

Discussion

DNA

related questions