If you want to look up the, let's say 50, most matching patterns, and we can assume that the input data set is rather static (or can be dynamically updated), you can repeat the initial phase of the previous comment, so:
- For every bit pattern, count the bits.
- Store the bit patterns in a multi_map (if you use STL, Java probably has something similar)
Then, use the following algorithm:
- Make 2 collections: one for storing the found patterns, one for storing possibly good patterns (this second collection should probably be map, mapping 'distances' to patterns)
- Take your own pattern and count the bits, assume this is N
- Look in the multimap at index N, all these patterns will have the same sum, but not necessarily be completely identical
- Compare all the patterns at index N. If they are equal store the result in the first collection. If they are not equal, store the result in the second collection/map, using the difference as key.
- Look in the multimap at index N-1, all these patterns will have a distance of 1 or more
- Compare all the patterns at index N-1. If they have a distance of 1, store them in the first collection. If they have a larger distance, store the result in the second collection/map, using the difference as key.
- Repeat for index N+1
- Now look in the second collection/map and see if there is something stored with distance 1. If it is, remove them from the second collection/map and store them in the first collection.
Repeat this for distance 2, distance 3, ... until you have enough patterns.
If the number of required patterns is not too big, and the average distance is also not too big, then the number of real compares between patterns is probably only a few %.
Unfortunately, since the patterns will be distributed using a Gaussian curve, there will still be quite some patterns to check. I didn't do a mathematical check on it, but in practice, if you don't want too many patterns out of the millions, and the average distance is not too far, you should be able to find the set of most-close patterns by checking only a few percent of the total bit patterns.
Please keep me updated of your results.