Contains Test for a Constant Set

The problem statement:

Given a set of integers that is known in advance, generate code to test if a single integer is in the set. The domain of the testing function is the integers in some consecutive range.

Nothing in particular is known now about the range or the set to be tested. The range could be small or huge (but a solution can reject problems that are to big but higher limits are better). It could be that very few of the values in the allowed range are in the set or most of them are or anything in between. The set may be uniformly distributed or clustered. There may be large sections of only contained/not-contained values or there may be at least a few of each type of value in most swaths. (sort of like the assumption made about items to be sorted when analyzing sorting algorithms)

The objective is a procedure for generating effective code for running the test.

Partial solutions that come to mind include

perfect hash function (costly for large sets)
range tests: foreach(b in ranges) if(b.l <= v && v <= b.h) return true;
trees/indexes (more costly than others in some cases)
table lookup (costly for large sets)
the inverse of any of these these (kodos to Jason S)

It seems that an ideal solution would be able to pick what option is best or if none work well, use a tree to break down the full range into sections and then switch to other options for subsection that are better suited to them.

Topic(s) that might be useful include:

Huffman coding

Note: this is not homework. if it was issued as homework below the doctoral level the prof should be shot with a Nerf gun (if you don't get that then re-read the problem, it is very much non trivial)

Note: This is a problem that occurred to me a few days a go and I've been puzzling over it off and on. I have no direct use for this but thought it would be a cool problem to attack. The reason that I wan to generate the code is because generated code will be no slower than general code (it can be the same thing if needed) and might be faster in some/many cases.

I'm posting this question as much to clarify my thoughts as anything. If I can come up with any reasonable or cool solutions I plan on implementing them as a template meta program (the other reason for generated code)

some people have noted that the problem is very general. That is the point. I'm hoping to generate a system that would work an a very general domain: sets of integers in some range.

Bloom filters are an extension to hashing. If I'm going that way, then I might as well do a minimum perfect hash or the like. It is a good thought and might be good for related questions.

BCS 2009-01-02 20:19:51

added some pragmatic notes to address your points. (even qsort will fail if the list so large as to run out of stack space)

BCS 2009-01-02 20:18:56

For unit test 1 it would be more efficient to handle the complement of the base set.

Jason S 2009-01-02 20:48:43

@[Jason S]: true, but given the problem description ("Nothing in particular is known now about the range or the set to be tested") you would have no way of knowing that!

Steven A. Lowe 2009-01-02 21:11:37

Jason S 2009-01-02 21:24:43

so your solution is a set of heuristics guessing at which special-case optimizations to apply; can implement as the strategy pattern for code generation, that would work - but since the problem is open ended, you would have an unbounded set of possible heuristics

Steven A. Lowe 2009-01-03 06:00:28

ansaurus

tags:

views:

answers:

Contains Test for a Constant Set

related questions