views:

110

answers:

3

The problem statement:

Given a set of integers that is known in advance, generate code to test if a single integer is in the set. The domain of the testing function is the integers in some consecutive range.


Nothing in particular is known now about the range or the set to be tested. The range could be small or huge (but a solution can reject problems that are to big but higher limits are better). It could be that very few of the values in the allowed range are in the set or most of them are or anything in between. The set may be uniformly distributed or clustered. There may be large sections of only contained/not-contained values or there may be at least a few of each type of value in most swaths. (sort of like the assumption made about items to be sorted when analyzing sorting algorithms)

The objective is a procedure for generating effective code for running the test.

Partial solutions that come to mind include

  • perfect hash function (costly for large sets)
  • range tests: foreach(b in ranges) if(b.l <= v && v <= b.h) return true;
  • trees/indexes (more costly than others in some cases)
  • table lookup (costly for large sets)
  • the inverse of any of these these (kodos to Jason S)

It seems that an ideal solution would be able to pick what option is best or if none work well, use a tree to break down the full range into sections and then switch to other options for subsection that are better suited to them.

Topic(s) that might be useful include:


Note: this is not homework. if it was issued as homework below the doctoral level the prof should be shot with a Nerf gun (if you don't get that then re-read the problem, it is very much non trivial)

Note: This is a problem that occurred to me a few days a go and I've been puzzling over it off and on. I have no direct use for this but thought it would be a cool problem to attack. The reason that I wan to generate the code is because generated code will be no slower than general code (it can be the same thing if needed) and might be faster in some/many cases.

I'm posting this question as much to clarify my thoughts as anything. If I can come up with any reasonable or cool solutions I plan on implementing them as a template meta program (the other reason for generated code)

some people have noted that the problem is very general. That is the point. I'm hoping to generate a system that would work an a very general domain: sets of integers in some range.

+1  A: 

a previous question on dictionary/spellchecking had a number of responses that mentioned Bloom filters; maybe that would help.

I would think that testing for large sets is going to be expensive no matter what.

Jason S
Bloom filters are an extension to hashing. If I'm going that way, then I might as well do a minimum perfect hash or the like. It is a good thought and might be good for related questions.
BCS
+1  A: 

let's pretend, for a moment, that this is a real question:

  • there are no limits on the size of the base set or the input set

this makes the "problem" unrealistic, underspecified, and un-solvable in any practical sense

if someone wants to posit a solution, here's some unit test cases:

  • unit test 1:

    • the base set is all integers between -1,000,000,000,000 and +1,000,000,000,000 except for 100,000,000,000 randomly-removed values
    • the input set is 100,000,000,000 randomly-generated integers in the same range
  • unit test 2:

    • the base set is the Fibonacci series
    • the input set is 1T randomly-generated integers in the range 0..infinity
Steven A. Lowe
added some pragmatic notes to address your points. (even qsort will fail if the list so large as to run out of stack space)
BCS
For unit test 1 it would be more efficient to handle the complement of the base set.
Jason S
@[Jason S]: true, but given the problem description ("Nothing in particular is known now about the range or the set to be tested") you would have no way of knowing that!
Steven A. Lowe
Jason S
so your solution is a set of heuristics guessing at which special-case optimizations to apply; can implement as the strategy pattern for code generation, that would work - but since the problem is open ended, you would have an unbounded set of possible heuristics
Steven A. Lowe
+1  A: 

there's also boost::dynamic_bitset, not sure how it scales for time, or in space with respect to distribution of original numbers. (e.g. if the bits are stored in chunks of 8/16/32/64, then sparse bitsets are inefficient)

or perhaps this (compressed bit set) or this (bit vector) webpage (I googled for "large sparse bit sets" and "compressed bit sets")

Jason S
BCS