views:

61

answers:

3

Like the title says, I'm trying to find elements of M that exist in the large constant array N. Most of the time, no element of M will exist in N, so the vast majority of searches done on M are a waste of time.

I'm looking for some way to create an index to check before doing a full-scale search of M. A project similar to mine creates a bit array from the first few bytes of every element of M, and from what I understand, leverages bit level parallelism to search it quickly. I don't understand entirely how this works.

So what tricks can I use to cut down the chance of searching M unnecessarily?

This is a mostly language independent question, but just to be as complete as possible, I'm using C++.

+2  A: 

You could build a hashtable with with the values of N as keys.

Then you try to access hash[M[i]], if it returns a value then it exists, that is O(1) (disregarding collisions.)

Vinko Vrsalovic
+1  A: 

Since N is static you might consider creating a Perfect Hash function for N. This will make your search guaranteed O(1) time.

The CLR book on algorithms has a chapter on this and wiki page above has links which you might find useful. It might be too complicated, though and you might be hard pressed to find a useful implementation.. Look at Gperf for an implementation.

You could always use a commonly available hash table with expected O(1) though.

I suppose you are storing some extra information which you want to retrieve knowing that it is there? How are you storing those?

You might find a B-Tree useful in that case (industry standard databases usually use a some variant of those), which could even serve as the index! So, you search, and if you find it, you have the data/pointer to it. You will find many implementations for these on the web.

Moron
good point; with static targets, a perfect hash can elliminate most hash's issues by getting rid of collisions.
Javier
@Javier: I guess they have a reason for calling it "perfect" :-)
Moron
+4  A: 

You might be thinking of Bloom filters, which are used for exactly this case. They can give you false positives, in which case you have to search in the real table, but in most cases will tell you from the start if you don't have the item stored.

Hash tables are usually the best option for storage; but if your key space is vastly larger than the number of targets, you'll have a sizable number of hash collisions where you'll have to check if the target stored there is really the key you're looking. If key comparison is expensive, it can quickly become a factor.

Javier
Cool, I hadn't heard of those
Vinko Vrsalovic
Ah, perfect. I'll see what other people can come up with before accepting this, but this seems to be the answer I was looking for. Thanks!
GenTiradentes
Yep. Bloom filters. And then back that up by having a way to sort N or sort M. With one or both sorted, you can reduce the complexity of your collision checks, by jumping into the middle and short circuiting out before the end.
Jason
Bloom filters are useful if N is very large and space is very important, i.e. O(N) space can be prohibitive. Perfect hash functions achieve perfect hashing i.e. one probe, with O(N) space usage. Time of execution wise, perfect hashes will beat bloom filters.
Moron
@pMoron: right, with a constant set of keys perfect hashes are far better since a bloom filter needs multiple hashes. In this case, bloom filters would be appropriate only if the full table won't fit in memory but the bloom bitmap usually would (typically needs around 4-10 bits per element)
Javier