views:

114

answers:

3

I have a possible solution to a problem I'm trying to address, but I wanted to run it by here just to be on the safe side. The challenge is to ensure that a user that has gone through some test questions in an exam application doesn't encounter them again during a subsequent test.

I'm not using a SQL database, which would allow me to use left joins, subqueries, temporary tables and so on. I'm using Google App Engine's datastore and hope to get the information I need within a single HTTP request and a single thread, preferably in under a second.

Suppose there is a pool of 100,000 vocabulary questions of a certain type (e.g., synonyms). The application would select 30 questions from the pool for a given section of an exam. Here's what I was thinking of doing:

  • When a question is fist created, assign it a random integer position within the pool.
  • At the time of the individual's first exam, choose a random number, then select the first 100 questions, ordered by position, whose positions are larger than that number. Keep track of the number as the lower bound of a window of questions and also as the starting position for the pool as a whole. Keep track of the (maximum) position of the last question in the result set as the upper bound of the new window.
  • Select 30 questions at random from the window and then give them as the section. Store the remaining 70 questions for use later on in the exam and possibly in a subsequent exam.
  • As the user goes through additional sections (for practice, say), and the list of remaining questions in the current window is depleted, select the next 100 questions from the pool whose positions are greater than the upper bound that was stored earlier. Make the old upper bound the new lower bound and find the upper bound of the new window.
  • When a query returns less than 100 questions, wrap around to a position of 0 and continue until the original starting point is encountered (it's unlikely that anyone would go through the entire pool, but it's important to be sure).

The main reason the positions are randomly assigned is to cancel out the effects of changes in the style of how questions are written, e.g., questions that were written earlier on when there was less experience versus later ones.

The application would assign a position to a question without checking to see that the position is unique. If there are a large enough number of questions, the birthday paradox suggests that duplicate positions will become more and more common. My thought was that it won't hurt to have occasional duplicates, and that this would make things simpler than ensuring that a given position is unique, which might entail a retry and the associated network costs that go with it. It's more important that there be no repeat questions than to ensure that every question within a range of questions is shown to a user.

Is there a better way to do this? Is it ok not to worry about duplicate positions?

+1  A: 

Use a floating point number between 0 and 1 instead of an integer. It has a nice domain, which doesn't change with the number of entities you have, and doubles have a 52 bit mantissa, which gives us approximately 2^26 objects before we can expect a collision; substantially more than what you're dealing with.

Nick Johnson
+1  A: 

I don't think you need to "choose 30 at random" out of your 100 question pool. The 100 were selected randomly to begin with - if you take the first 30, those will already be randomized. Your code will be simpler, with no less randomness.

Peter Recore
Very good point.The idea there was that, for smaller pools, 300 problems, say, I didn't want people to have the sense that they were going through approximately the same problem sets once they wrap around. So it was an extraneous detail in this case, and it (kind of) goes against the question being asked, i.e., what is an efficient way to prevent people from seeing the same problems over again? But I didn't want to make the original question too complex by throwing in too many considerations.
Eric W.
ahhh. Ultimately it depends on the type of randomness you want the users to experience. With the 'pick 30 out of subpools of 100' method, each time they wrap around the main pool, they will likely see question X 4 or 5 times before they see question Y, and there will be some questions they never see at all. If you go through the entire 300 in a random order, they will see every question 1 time before seeing any duplicates. To keep it interesting for users, you might create multiple whole-pool orderings, so that there might be 5 ways to go through the 300 items.
Peter Recore
+1  A: 

Put the items in some order and give each one a number. Maintain a counter for the user which starts at zero. Generate a random(ish) key for each user. Choose a function which maps the combination of an integer in the range 0...N and a key to another integer in the range 0...N, such that for any value off the key, the mapping between integers is bijective. When you need an item, take the value of the counter, put it through the function along with the key, and use that number to index into the list of items. Increment the counter.

All you need to store for each user is the key and the counter, and you have a guarantee that no item will ever be repeated. It's basically a way of constructing a huge permutation on the fly.

The problem is of course choosing the function!

A simple one would be f(counter, key) = (counter + key) mod N, and whilst this would work, it doesn't randomise the items at all, so everyone will get the items in the same order, just starting at different places. If N + 1 is prime, you can use F(counter, key) = (((counter + 1) * (key + 1)) mod (N + 1)) - 1, which would work pretty well. If the range was 0...2^64, you could use any block cipher which has a 64-bit block, which would give excellent randomisation. But it's not clear that you could adapt this to smaller sizes.

I'll have a bit of a dig to see if i can come up with a generically applicable function. This is a problem i've come across myself, and it would be great to finally have a good answer. I'll edit this answer if i find anything.

EDIT: Okay, here we go! I got the key ideas from a thread i started on sci.crypt, and in particular from one Paul Rubin, who is an all-round usenet hero.

Slight change of strategy. Put your items in a list, so they can be accessed by an index. Pick a number B, such that 2^B >= N - any value will do, but you really want the smallest one (ie the ceiling of the base-2 logarithm of N-1). We then refer to 2^B as M. Set up a counter which will run from 0 to M-1, and a key for each user, which can be an arbitrary byte string - a random integer is probably simplest. Set up the Magic Function, which is a bijection, aka a permutation, on the set of integers 0 ... M-1 (see below!). When you want an item, take the value of the counter, increment the counter, then put the original value through the Magic Function to get an index; if the index is greater than N-1, then throw it away, and repeat the process. Repeat until you get an index that is less than N. Sometimes, you'll need to go through one or more useless indices before you get a good one, but on average, it will take M/N tries, which is not too bad (that's less than 2 if you chose the smallest possible B).

Incidentally, calculating the ceiling of the base 2 logarithm of a number is easy:

int lb(int x) {
    int lb = 0;
    while (x > 0) {
        ++lb;
        x >>= 1;
    }
    return lb;
}

Okay, so the Magic Function, which maps a number from 0 ... M-1 to another such number. I mentioned block ciphers above, and that's what we're going to use. Except that our block is B bits long, which is variable and smaller than the normal 64 or 128 bits. So we need a cipher that works on small, variable-sized blocks. So we're going to write our own - a variable-sized mildly-unbalanced Feistel cipher. Easy!

To do a Feistel cipher, you need:

  • Numbers B_L and B_R such that B_L + B_R = B. In a balanced Feistel cipher, B_L = B_R, so B has to be even; we'll use B_L = B/2, and B_R = B - B_L, ie similar to a balanced network, but with a slightly bigger B_R when B is odd.
  • A number of rounds, called C; we'll count rounds with i, which will go from 0 to C-1. I've been told that 4 and 10 are absolute minimums, so let's go with 12 to be on the safe side.
  • A key schedule, which is just a key for each round, each called K_i, all derived from the main key i mentioned above. You could just use the same key for each round; proper ciphers have some way of generating subkeys from the master key. I'd suggest just concatenating the round number to the master key.
  • A round function, called F, which takes a B_R-bit sub-block and a round key, and produces a B_L-bit sub-block; within those constraints, it can be absolutely anything. This is the heart of a Feistel cipher. For us, the best option is just to use an already-existing cryptographic hash function, as this is easy and reliable. SHA-1 is the current favourite. We'll feed it the round key, followed by the input sub-block, compute the hash, then take B_L bits from one end (doesn't matter which) as our output.

A Feistel cipher works like this:

  1. Take a B-bit input word
  2. For i from 0 to C-1:
    1. Split the word into left and right sub-blocks L and R, with B_L and B_R bits respectively
    2. Put R and K_i through the round function F to produce a ciphering value X
    3. Add X to L, discarding any carry bit which overflows from the top bit
    4. Reassemble L and R the wrong way round, with R on the left and L on the right, to form a new value for B

The final value for B is the encrypted value, and is the output from the cipher. Decrypting it is pretty simple - do the above in reverse - but since we don't need to decrypt, don't worry about that.

So there you go. Maintain a counter (and a key, and the value of M), encrypt its value with a tiny cipher, and use the result as an index. Given that the cipher is a permutation, it's easy to show that this will never repeat, which should keep your clients happy. Even better, given the cryptographic properties of the cipher, you can also claim that the students will be unable to predict which question will come next (which is probably not important, but sounds cool).

All this is a bit more complicated than just incrementing a counter and giving them the items in order, but it's not that hard. You could do it in a hundred lines of java. Well, i can do it in a hundred lines of java - i don't know about you! :)

Incidentally, this will work with a growing pool of items, provided that you always add items at the end, although it will never pick items numbered higher than M. In many cases, though, that will give you some room for growth.

Tom Anderson
Very interesting approach. I'm definitely interested in seeing what you come up with. In my case I'll need to be able to assure educational organizations that, when an exam is administrated, the participant will not be going through problems he or she has encountered before, so this is a pretty important point. I suppose I can divide the pool into two pools, practice questions and test questions, but I'd prefer to draw from the same pool.
Eric W.
One question I had is whether it is possible to avoid assuming a fixed pool size for this approach. Can it be made to work when items are inserted in the middle of a pool or added to the end?
Eric W.
Okay, custom small-block Feistel cipher with a hash as the round function, plus a modified Hasty Pudding trick. Can be made to work for a growing pool provided the items are only added at the end. Will explain in detail tomorrow!
Tom Anderson