views:

503

answers:

3

Given a long string L and a shorter string S (the constraint is that L.length must be >= S.length), I want to find the minimum Hamming distance between S and any substring of L with length equal to S.length. Let's call the function for this minHamming(). For example,

minHamming(ABCDEFGHIJ, CDEFGG) == 1.

minHamming(ABCDEFGHIJ, BCDGHI) == 3.

Doing this the obvious way (enumerating every substring of L) requires O(S.length * L.length) time. Is there any clever way to do this in sublinear time? I search the same L with several different S strings, so doing some complicated preprocessing to L once is acceptable.

Edit: The modified Boyer-Moore would be a good idea, except that my alphabet is only 4 letters (DNA).

A: 

You're stuck as far as big-O is concerned.. At a fundamental level, you're going to need to test if every letter in the target matches each eligible letter in the substring.

Luckily, this is easily parallelized.

One optimization you can apply is to keep a running count of mismatches for the current position. If it's greater than the lowest hamming distance so far, then obviously you can skip to the next possibility.

rlbond
-1 sorry. In fact faster algorithms do exist for this problem (see my post) though they aren't necessarily obvious.
j_random_hacker
+1  A: 

Modified Boyer-Moore

I've just dug up some old Python implementation of Boyer-Moore I had lying around and modified the matching loop (where the text is compared to the pattern). Instead of breaking out as soon as the first mismatch is found between the two strings, simply count up the number of mismatches, but remember the first mismatch:

current_dist = 0
while pattern_pos >= 0:
    if pattern[pattern_pos] != text[text_pos]:
        if first_mismatch == -1:
            first_mismatch = pattern_pos
            tp = text_pos
        current_dist += 1
        if current_dist == smallest_dist:
           break

     pattern_pos -= 1
     text_pos -= 1

 smallest_dist = min(current_dist, smallest_dist)
 # if the distance is 0, we've had a match and can quit
 if current_dist == 0:
     return 0
 else: # shift
     pattern_pos = first_mismatch
     text_pos = tp
     ...

If the string did not match completely at this point, go back to the point of the first mismatch by restoring the values. This makes sure that the smallest distance is actually found.

The whole implementation is rather long (~150LOC), but I can post it on request. The core idea is outlined above, everything else is standard Boyer-Moore.

Preprocessing on the Text

Another way to speed things up is preprocessing the text to have an index on character positions. You only want to start comparing at positions where at least a single match between the two strings occurs, otherwise the Hamming distance is |S| trivially.

import sys
from collections import defaultdict
import bisect

def char_positions(t):
    pos = defaultdict(list)
    for idx, c in enumerate(t):
        pos[c].append(idx)
    return dict(pos)

This method simply creates a dictionary which maps each character in the text to the sorted list of its occurrences.

The comparison loop is more or less unchanged to naive O(mn) approach, apart from the fact that we do not increase the position at which comparison is started by 1 each time, but based on the character positions:

def min_hamming(text, pattern):
    best = len(pattern)
    pos = char_positions(text)

    i = find_next_pos(pattern, pos, 0)

    while i < len(text) - len(pattern):
        dist = 0
        for c in range(len(pattern)):
            if text[i+c] != pattern[c]:
                dist += 1
                if dist == best:
                    break
            c += 1
        else:
            if dist == 0:
                return 0
        best = min(dist, best)
        i = find_next_pos(pattern, pos, i + 1)

    return best

The actual improvement is in find_next_pos:

def find_next_pos(pattern, pos, i):
    smallest = sys.maxint
    for idx, c in enumerate(pattern):
        if c in pos:
            x = bisect.bisect_left(pos[c], i + idx)
            if x < len(pos[c]):
                smallest = min(smallest, pos[c][x] - idx)
    return smallest

For each new position, we find the lowest index at which a character from S occurs in L. If there is no such index any more, the algorithm will terminate.

find_next_pos is certainly complex, and one could try to improve it by only using the first several characters of the pattern S, or use a set to make sure characters from the pattern are not checked twice.

Discussion

Which method is faster largely depends on your dataset. The more diverse your alphabet is, the larger will be the jumps. If you have a very long L, the second method with preprocessing might be faster. For very, very short strings (like in your question), the naive approach will certainly be the fastest.

DNA

If you have a very small alphabet, you could try to get the character positions for character bigrams (or larger) rather than unigrams.

Torsten Marek
Oh right, and I'm sorry all the code examples are in Python, but that's just the language I'm most comfortable with. Please feel free to ask for clarifications!
Torsten Marek
Personally, I think people should always post code in whatever language they are most comfortable with. If nothing else, it teach others a bit about that language.
Daniel
+1 for clean code samples, and I'd +1 again (if allowed) for the bigram or bigger with small alphabet idea. 4-grams of DNA would be similar to 8-bit bytes, after all. You'd just need to think about the transformation from your n-gram back to the real sequence a bit.
RBerteig
+5  A: 

Perhaps surprisingly, this exact problem can be solved in just O(|A|nlog n) time using Fast Fourier Transforms (FFTs), where n is the length of the larger sequence L and |A| is the size of the alphabet.

Here is a freely available PDF of a paper by Donald Benson describing how it works:

Summary: Convert each of your strings S and L into several indicator vectors (one per character, so 4 in the case of DNA), and then convolve corresponding vectors to determine match counts for each possible alignment. The trick is that convolution in the "time" domain, which ordinarily requires O(n^2) time, can be implemented using multiplication in the "frequency" domain, which requires just O(n) time, plus the time required to convert between domains and back again. Using the FFT each conversion takes just O(nlog n) time, so the overall time complexity is O(|A|nlog n). For greatest speed, finite field FFTs are used, which require only integer arithmetic.

Note: For arbitrary S and L this algorithm is clearly a huge performance win over the straightforward O(mn) algorithm as |S| and |L| become large, but OTOH if S is typically shorter than log|L| (e.g. when querying a large DB with a small sequence), then obviously this approach provides no speedup.

UPDATE 21/7/2009: Updated to mention that the time complexity also depends linearly on the size of the alphabet, since a separate pair of indicator vectors must be used for each character in the alphabet.

j_random_hacker
FFT? That's a cool answer!
Daniel
Not practical for my situation b/c S is very small compared to L, but an interesting enough answer to deserve an upvote anyhow.
dsimcha
Updated to mention that the time complexity also depends linearly on the size of the alphabet.
j_random_hacker
Thanks Daniel! Yes, FFTs are one of those things that seem kind of magical to me -- as though you shouldn't be able to calculate something that fast :) Other things in that should-be-impossible category for me are the fact that suffix trees/arrays can be built and queried in linear time :)
j_random_hacker
Neat! @dsimcha: If S is very small — as short as log|L| letters — then why bother? Note that you have an Ω(|L|) lower bound (you need to look at all letters in L)... so are you in a state where |L| is small enough but |L|*log|L| is too large?
ShreevatsaR