views:

92

answers:

1

I read about kd-trees but they are inefficient when the dimensionality of the space is high. I have a database of value and I want to find the values that are within a certain hamming distance of the query. For instance, the database is a list of 32-bit numbers and I want to find all numbers that differ from the query value by less than 3 bits.

I heard somewhere about MultiVariate Partition trees but couldn't find a good reference. I know that min-Hash gives a good approximation, better as the but I'd like an exact answer.

A: 

The hamming distance is closely related to levenshtein distance, and is similar to algorithms used for spelling correction.

A method that works is branch-and-bound search in a trie. It takes time that is exponential in the distance, for near distance, up to being linear in the dictionary size.

If the dictionary is of binary words stored in a binary trie, with strict hamming distance, here is a simple pseudo-code:

walk(trie, word, i, hit, budget){
  if (budget < 0 || i > word.length) return;
  if (trie==NULL){
    if (i==word.length) print hit;
    return;
  }
  hit[i] = 0;
  walk(trie.subtrie[0], word, i+1, hit, (word[i]==0 ? budget : budget-1));
  hit[i] = 1;
  walk(trie.subtrie[1], word, i+1, hit, (word[i]==1 ? budget : budget-1));
}

main(){
  for (int budget = 0; ; budget++){
    walk(trie, word, 0, hit, budget);
    /* quit if enough hits have been printed */
  }
}

The idea is you walk the entire trie, keeping track of the distance between the current trie node and the original word. You prune the search by having a budget of how much distance you will tolerate. This works because the distance can never decrease as you go deeper into the trie.

Then you do this repeatedly with budgets starting at zero and increasing in steps until you print out the hits you want. Since each walk covers so many fewer nodes than the subsequent walk, it doesn't hurt that you're doing multiple walks. If k is fixed, you can simply start out with that as your budget.

Mike Dunlavey