views:

477

answers:

7

Given a query string Q of length N, and a list L of M sequences of length exactly N, what is the most efficient algorithm to find the string in L with the fewest mismatch positions to Q? For example:

Q = "ABCDEFG";
L = ["ABCCEFG", "AAAAAAA", "TTAGGGT", "ZYXWVUT"];
answer = L.query(Q);  # Returns "ABCCEFG"
answer2 = L.query("AAAATAA");  #Returns "AAAAAAA".

The obvious way is to scan every sequence in L, making the search take O(M * N). Is there any way to do this in sublinear time? I don't care if there's a large upfront cost to organizing L into some data structure because it will be queried a lot of times. Also, handling tied scores arbitrarily is fine.

Edit: To clarify, I am looking for the Hamming distance.

+1  A: 

I think you are looking for the Levenshtein edit distance.

There are a few questions here on SO about this already, I suppose you can find some good answers.

Tomalak
The google link is missing some spaces
Jens Schauder
Not really. He's looking for the fastest way to find the string with the lowest edit distance out of a list of them.
chaos
@Chaos: The fastest way is to look at the edit distance (Levensthein or some other algorithm, doesn't matter here) for every string in the list and then take the first one with the lowest distance. How else would it be done?
Tomalak
Of course you can short-cut on a perfect match, but that's about it.
Tomalak
There are definitely faster ways.
chaos
+1  A: 

You could treat each sequence as an N-dimensional coordinate, chunk the resulting space into blocks that know what sequences occur in them, then on a lookup first search the search sequence's block and all contiguous blocks, then expand outward as necessary. (Maintaining several scopes of chunking is probably more desirable than getting into searching really large groups of blocks.)

chaos
A: 

Some variety of best-first search on the target sequences will do much better than O(M * N). The basic idea of this is that you'd compare the first character in your candidate sequence with the first character of the target sequences, then in your second iteration only do the next-character comparison with the sequences that have the least number of mismatches, and so on. In your first example, you'd wind up comparing against ABCCEFG and AAAAAAA the second time, ABCCEFG only the third and fourth times, all the sequences the fifth time, and only ABCCEFG thereafter. When you get to the end of your candidate sequence, the set of target sequences with the lowest mismatch count is your match set.

(Note: at each step you're comparing against the next character for that branch of the search. None of the progressive comparisons skip characters.)

chaos
won't work if the you have baaa and abbb as options and look for aaaa. It will throw out the the correct answer in the first iteration.
Jens Schauder
Incorrect. Something like depth-first search would do that; BFS won't. It won't look at the correct answer on the second iteration, but it will look at it on the third and fourth, and identify it correctly.
chaos
Where you're going wrong is that you're thinking it's throwing things out. It isn't; it's moving them down a priority queue.
chaos
+1  A: 

Are you looking for the Hamming distance between the strings (i.e. the number of different characters at equivalent locations)?

Or does the distance "between" characters (e.g. difference between ASCII values of English letters) matter to you as well?

Assaf Lavie
+1 Well, upon reading the question again it's more likely to be Hamming than Levensthein.
Tomalak
+2  A: 

Locality sensitive hashing underlies what seems to be the asymptotically best method known, as I understand it from this review article in CACM. Said article is pretty hairy and I didn't read it all. See also nearest neighbor search.

To relate these references to your problem: they all deal with a set of points in a metric space, such as an n-dimensional vector space. In your problem, n is the length of each string, and the values on each coordinate are the characters that can appear at each position in a string.

Darius Bacon
A: 

If up front cost don't matter you could calculate the best match for every possible input, and put the result in a hash map.

Of course this won't work if N isn't exremely small.

Jens Schauder
A: 

I can't think of a general, exact algorithm which will be less than O(N * M), but if you have a small enough M and N you can make an algorithm which performs as (N + M) using bit-parallel operations.

For example, if N and M are both less than 16, you could use a N * M lookup table of 64 bit ints ( 16*log2(16) = 64), and perform all operations in one pass through the string, where each group of 4 bits in the counter counts 0-15 for one of the string being matched. Obviously you need M log2(N+1) bits to store the counters, so might need to update multiple values for each character, but often a single pass lookup can be faster than other approaches. So it's actually O( N * M log(N) ), just with a lower constant factor - using 64 bit ints introduces a 1/64 into it, so should be better if log2(N) < 64. If M log2(N+1) < 64, it works out as (N+M) operations. But that's still linear, rather than sub-linear.

#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <inttypes.h>

size_t match ( const char* string, uint64_t table[][128] ) ;

int main ()
{
    const char* data[] = { "ABCCEFG", "AAAAAAA", "TTAGGGT", "ZYXWVUT" };
    const size_t N = 7;
    const size_t M = 4;

    // prepare a table
    uint64_t table[7][128] = { 0 };

    for ( size_t i = 0; i < M; ++i )
        for ( size_t j = 0; j < N; ++j )
            table[j][ (size_t)data[i][j] ] |= 1 << (i * 4);

    const char* examples[] = { "ABCDEFG", "AAAATAA", "TTAGQQT", "ZAAGVUT" };

    for ( size_t i = 0; i < 4; ++i ) {
        const char* q = examples[i];
        size_t result = match ( q, table );

        printf("Q(%s) -> %zd %s\n", q, result, data[result]);
    }
}

size_t match ( const char* string, uint64_t table[][128] )
{
    uint64_t count = 0;

    // scan through string once, updating all counters at once
    for ( size_t i = 0; string[i]; ++i )
        count += table[i][ (size_t) string[i] ];

    // find greatest sub-count within count
    size_t best = 0;
    size_t best_sub_count = count & 0xf;

    for ( size_t i = 1; i < 4; ++i ) {
        size_t sub_count = ( count >>= 4 ) & 0xf;

        if ( sub_count > best_sub_count ) {
            best_sub_count = sub_count;
            best = i;
        }
    }

    return best;
}
Pete Kirkham