ansaurus

Question

Package for fast determination of similarity between two bit sequences

Answer 1

+2 A:

If the database is rather static, you may want to build a tree data structure on it.

Search the tree recursively or in multiple threads and per search keep an actual difference variable. If the actual difference becomes greater than what you would consider 'similar', abort the search.

E.g. Suppose we have the following tree:

      root
   0       1
 0   1   0   1
0 1 0 1 0 1 0 1

If you want to look for patterns similar to 011, and only want to allow 1 different bit at most, search like this (recursively or multi-threaded):

Start at the root
Take the left branch (0), this is similar, so difference is still 0
- Take the left branch (0), this is different, so difference becomes 1, which is still acceptable
  - take the left branch (0), this is different, so difference becomes 2, which is too high. Abort looking in this branch.
  - take the right branch (1), this is equal, so difference remains 1, continue to search in this branch (not shown here)
- Take the right branch (1), this is equal, so difference remains 0, go on
  - take the left branch (0), this is different, so difference becomes 1, which is still acceptable, go on.

This goes on until you have found your bit patterns.

If your bit patterns are more dynamic and being updated in your application, you will have to update the tree.

If memory is a problem, consider going to 64-bit.

Patrick 2010-02-14 11:17:43

I like your approach. However, the problem with both your answers is that we they need an a priori definition of similarity, whereas we just want to retrieve the top 50 hits, even if the similarity score for those 50 best hits is low. Perhaps I should have included this in my question...

Simmer 2010-02-15 13:50:16

Answer 2

+1 A:

I came up with a second alternative.

For every bit pattern of the million ones count the number of bits and store the bit patterns in an STL multi_map (if you're writing in C++).

Then count the number of bits in your pattern. Suppose you have N bits set in your bit pattern.

If you now want to allow at most D differences, look up all the bit patterns in the multi_map having N-D, N-D+1, ..., N-1, N, N+1, ... N+D-1, N+D bits.

Unfortunately, the division of bit patterns in the multi_map will follow a Gaussian pattern, which means that in practice you will still have to compare quite some bit patterns.

(Originally I thought this could be solved by counting even 0's and uneven 1's but this isn't true.)

Assuming that you want to allow 1 difference, you have to look up 3 slots in the multi_map out of the 100 possible slots, leaving you with 3% of the actual bit patterns to do a full compare.

Patrick 2010-02-14 11:43:21

See my comment on your other answer

Simmer 2010-02-15 13:50:51

Answer 3

+2 A:

If you want to look up the, let's say 50, most matching patterns, and we can assume that the input data set is rather static (or can be dynamically updated), you can repeat the initial phase of the previous comment, so:

For every bit pattern, count the bits.
Store the bit patterns in a multi_map (if you use STL, Java probably has something similar)

Then, use the following algorithm:

Make 2 collections: one for storing the found patterns, one for storing possibly good patterns (this second collection should probably be map, mapping 'distances' to patterns)
Take your own pattern and count the bits, assume this is N
Look in the multimap at index N, all these patterns will have the same sum, but not necessarily be completely identical
Compare all the patterns at index N. If they are equal store the result in the first collection. If they are not equal, store the result in the second collection/map, using the difference as key.
Look in the multimap at index N-1, all these patterns will have a distance of 1 or more
Compare all the patterns at index N-1. If they have a distance of 1, store them in the first collection. If they have a larger distance, store the result in the second collection/map, using the difference as key.
Repeat for index N+1
Now look in the second collection/map and see if there is something stored with distance 1. If it is, remove them from the second collection/map and store them in the first collection.

Repeat this for distance 2, distance 3, ... until you have enough patterns.

If the number of required patterns is not too big, and the average distance is also not too big, then the number of real compares between patterns is probably only a few %.

Unfortunately, since the patterns will be distributed using a Gaussian curve, there will still be quite some patterns to check. I didn't do a mathematical check on it, but in practice, if you don't want too many patterns out of the millions, and the average distance is not too far, you should be able to find the set of most-close patterns by checking only a few percent of the total bit patterns.

Please keep me updated of your results.

Patrick 2010-02-15 17:40:33

I will have to see how this works out for my data. I will check how the number of bits per pattern is distributed. It is very nice that you have taken an interest in my problem! I will keep you updated.

Simmer 2010-02-19 12:59:50

ansaurus

tags:

views:

answers:

Package for fast determination of similarity between two bit sequences

related questions