views:

421

answers:

13

I have some data, up to a between a million and a billion records, each which is represented by a bitfield, about 64 bits per key. The bits are independent, you can imagine them basically as random bits.

If I have a test key and I want to find all values in my data with the same key, a hash table will spit those out very easily, in O(1).

What algorithm/data structure would efficiently find all records most similar to the query key? Here similar means that most bits are identical, but a minimal number are allowed to be wrong. This is traditionally measured by Hamming distance., which just counts the number of mismatched bits.

There's two ways this query might be made, one might be by specifying a mismatch rate like "give me a list of all existing keys which have less than 6 bits that differ from my query" or by simply best matches, like "give me a list of the 10,000 keys which have the lowest number of differing bits from my query."

You might be temped to run to k-nearest-neighbor algorithms, but here we're talking about independent bits, so it doesn't seem likely that structures like quadtrees are useful.

The problem can be solved by simple brute force testing a hash table for low numbers of differing bits. If we want to find all keys that differ by one bit from our query, for example, we can enumerate all 64 possible keys and test them all. But this explodes quickly, if we wanted to allow two bits of difference, then we'd have to probe 64*63=4032 times. It gets exponentially worse for higher numbers of bits.

So is there another data structure or strategy that makes this kind of query more efficient? The database/structure can be preprocessed as much as you like, it's the query speed that should be optimized.

A: 

Well, you could insert all of the neighbor keys along with the original key. That would mean that you store (64 choose k) times as much data, for k differing bits, and it will require that you decide k beforehand. Though you could always extend k by brute force querying neighbors, and this will automatically query the neighbors of your neighbors that you inserted. This also gives you a time-space tradeoff: for example, if you accept a 64 x data blowup and 64 times slower you can get two bits of distance.

Martin Hock
+1  A: 

I'd go with an inverted index, like a search engine. You've basically got a fixed vocabulary of 64 words. Then similarity is measured by hamming distance, instead of cosine similarity like a search engine would want to use. Constructing the index will be slow, but you ought to be able to query it with normal search enginey speeds.

The book Introduction to Information Retrieval covers the efficient construction, storage, compression and querying of inverted indexes.

Jay Kominek
You make a good point about the solution I posted...FAIL
PeterAllenWebb
Unless my new solution is wrong, though, a lot of the suggest approaches will be overly-complex.
PeterAllenWebb
+1  A: 

"Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions", from 2008, seems to be the best result as of then. I won't try to summarize since I read it over a year ago and it's hairy. That's from a page on locality-sensitive hashing, along with an implementation of an earlier version of the scheme. For more general pointers, read up on nearest neighbor search.

This kind of question has been asked before: Fastest way to find most similar string to an input?

Darius Bacon
That is about real numbers, not bits.
bayer
See the parts on the Hamming distance or L1 norm. Maybe I will go to the trouble of rereading this to summarize it, but I can't today. You're right that section 4 with its new result works on Euclidean distances; I should've remembered that; most of the paper though is a review article working on metric spaces in general.
Darius Bacon
Also, the library linked to is said to support Hamming distance.
Darius Bacon
However, MinHash is much better and far easier to implement. That near optimal hashing paper uses a lot of geometry (like hyperspheres and stuff) which are just not necessary for bit strings.
bayer
+3  A: 

This sounds like a good fit for an S-Tree, which is like a hierarchical inverted file. Good resources on this topic include the following papers:

Hierarchical Bitmap Index: An Efficient and Scalable Indexing Technique for Set-Valued Attributes.

Improved Methods for Signature-Tree Construction (2000)

Quote from the first one:

The hierarchical bitmap index efficiently supports dif- ferent classes of queries, including subset, superset and similarity queries. Our experiments show that the hierarchical bitmap index outperforms other set indexing techniques significantly.

These papers include references to other research that you might find useful, such as M-Trees.

Apocalisp
+3  A: 

Create a binary tree (specifically a trie) representing each key in your start set in the following way: The root node is the empty word, moving down the tree to the left appends a 0 and moving down the right appends a 1. The tree will only have as many leaves as your start set has elements, so the size should stay manageable.

Now you can do a recursive traversal of this tree, allowing at most n "deviations" from the query key in each recursive line of execution, until you have found all of the nodes in the start set which are within that number of deviations.

PeterAllenWebb
This also supports changes in O(log(bitlength)) time
John Pirie
So you'd have a stack of recursive problems to solve. At the root, lets say your key has a "1" for the first bit. You'd push a problem onto the stack of finding all matches with up to k errors for the "1" subtree, and also push onto the stack the problem of finding all matches with up to k-1 errors for the "0" subtree. Repeat. Seems reasonable. (Even parallelizable.)
SPWorley
Could this be done more compactly by storing the keys in a simple sorted list? Then each level of recursion is just a simple RANGE to search. It would be slower because you have to do a binary search each step to find the dividing point of the current range, but that may be quite small. The big win.. no pointer overhead, easy insertion and deletion, all data is local.
SPWorley
This idea also is obvious for solving the "all keys with up to k errors". How about the similar problem "1000 nearest keys"? Could this be solved by just making a stack of work into a heap, and always working on the subjob with the most number of allowable errors left?
SPWorley
There are a couple of ways you could extend this to return a specific number of keys. The most natural way would be to iteratively increase the number of errors that you would allow until either the total set of words you've found is adequately large, or you run out of keys.
PeterAllenWebb
If we're dealing with 64 bit strings, and the tree only increases by length 1 for every level of depth added, you're going to have, oh, hm, about 2^64 nodes in your tree. While search time will be fast, the storage space is going to be, ah, excessive.
Jay Kominek
Updated answer to clarify why excessive space should not be needed.
PeterAllenWebb
So what does the tree look like if my only values are 100 and 001? Your description doesn't convey how to construct it with fewer than 15 nodes.
Jay Kominek
Er, sorry, 7 nodes.
Jay Kominek
It would have 2 leaves, but 7 nodes, just as you say. It's easy to see that the number of nodes couldn't possibly be more than 64 times the number of elements and will in fact be much smaller. Since the number of elements in the current scenario is "millions or billions" my instincts tell me it should be fine.
PeterAllenWebb
I'm not sure I understand; how would you compute Hamming distance between two nodes using this scheme?
wkf
A: 

If the data weren't so sparse, a graph with keys as the vertices and edges linking 'adjacent' (Hamming distance = 1) nodes would probably be very efficient time-wise. The space would be very large though, so in your case, I don't think it would be a worthwhile tradeoff.

Harper Shelby
A: 

I haven't completely thought this through, but I have an idea of where I'd start.

You could divide the search space up into a number of buckets where each bucket has a bucket key and the keys in the bucket are the keys that are more similar to this bucket key than any other bucket key. To create the bucket keys, you could randomly generate 64 bit keys and discard any that are too close to any previously created bucket key, or you could work out some algorithm that generates keys that are all dissimilar enough. To find the closest key to a test key, first find the bucket key that is closest, and then test each key in the bucket. (Actually, it's possible, but not likely, for the closest key to be in another bucket - do you need to find the closest key, or would a very close key be good enough?)

David Johnstone
+5  A: 

What you want is a BK-Tree. It's a tree that's ideally suited to indexing metric spaces (your problem is one), and supports both nearest-neighbour and distance queries. I wrote an article about it a while ago.

BK-Trees are generally described with reference to text and using levenshtein distance to build the tree, but it's straightforward to write one in terms of binary strings and hamming distance.

Nick Johnson
An interesting read (well 'reads' technically, since I read some of the papers as well). Especially nice because it's so easy to implement. Thanks!
wkf
Wow, the BK tree is clever and fascinating! It would work in this application, BUT it's not efficient at all.. the BK tree allows generalized edit distances and therefore can't make even partitions at each node branch. +1 for a great reference, though I think the simpler binary trees will work best for the bit-wise hamming distance.
SPWorley
I'm not quite sure I see the problem. Are you suggesting it's inefficient for nearest-neighbour, or inefficient in general? I freely admit I haven't looked in detail at how I'd do nearest neighbour in a BK-Tree, but I was under the impression it'd be fairly straightforward.
Nick Johnson
It's inefficient in that it's not partitioning in very uniform clumps. A BK tree is very general, but it means that every node has 31 children, each child with twice the number of children as its last sibling. So the tree becomes very very very deep compared to a binary tree (with maximum depth 32). However the generality of BK trees is great not just for this problem but for the edit distance problem.
SPWorley
I don't see where you get 31 children, or each level having twice as many children as the last, from. Also, a tree that broad would be shallow, not deep. I get the impression we're talking at cross-purposes somehow.
Nick Johnson
A: 

If you are okay with a randomized algorithm (monte carlo in this case), you can use the minhash. http://blogs.msdn.com/spt/archive/2008/06/10/set-similarity-and-min-hash.aspx

bayer
+1  A: 

The database/structure can be preprocessed as much as you like

Well...IF that is true. Then all you need is a similarity matrix of your hamming distances. Make the matrix sparse by pruning out large distances. It doesn't get any faster and not that much of a memory hog.

Ray
A: 
A: 

Assuming you have to visit each row to test its value (or if you index on the bitfield then each index entry), then you can write the actual test quite efficiently using

A xor B

To find the difference bits, then bit-count the result, using a technique like:

http://www.crsr.net/files/population_count.pdf

This effectively gives you the hamming distance.

Since this can compile down to tens of instructions per test, this can run pretty fast.

-Alex

Alex Brown
A: 

If you're ok with doing it probabilistically, I think there's a good way to solve question 2. I assume you have 2^30 data and cutoff and you want to find all points within cutoff distance from test.

One_Try()
    1. Generate randomly a 20-bit subset S of 64 bits
    2. Ask for a list of elements that agree with test on S (about 2^10 elements)
    3. Sort that list by Hamming distance from test 
    4. Discard the part of list after cutoff

You repeat One_Try as much as you need while merging the lists. The more tries you have, the more points you find. For example, if x is within 5 bits, you'll find it in one try with about (2/3)^5 = 13% probability. Therefore if you repeat 100 tries you find all but roughly 10^{-6} of such x. Total time: 100*(1000*log 1000).

The main advantage of this is that you're able to output answers to question 2 as you proceed, since after the first few tries you'll certainly find everything within distance not more than 3 bits, etc.

If you have many computers, you give each of them several tries, since they are perfectly parallelizable: each computer saves some hash tables in advance.

ilya n.