views:

336

answers:

6

I'm studying for upcoming interviews and have encountered this question several times (written verbatim)

Find or determine non existence of a number in a sorted list of N numbers where the numbers range over M, M >> N and N large enough to span multiple disks. Algorithm to beat O(log n); bonus points for constant time algorithm.

First of all, I'm not sure if this is a question with a real solution. My colleagues and I have mused over this problem for weeks and it seems ill formed (of course, just because we can't think of a solution doesn't mean there isn't one). A few questions I would have asked the interviewer are:

  • Are there repeats in the sorted list?
  • What's the relationship to the number of disks and N?

One approach I considered was to binary search the min/max of each disk to determine the disk that should hold that number, if it exists, then binary search on the disk itself. Of course this is only an order of magnitude speedup if the number of disks is large and you also have a sorted list of disks. I think this would yield some sort of O(log log n) time.

As for the M >> N hint, perhaps if you know how many numbers are on a disk and what the range is, you could use the pigeonhole principle to rule out some cases some of the time, but I can't figure out an order of magnitude improvement.

Also, "bonus points for constant time algorithm" makes me a bit suspicious.

Any thoughts, solutions, or relevant history of this problem?

+6  A: 

Strange enough the question is to determine the NON-EXISTENCE of a value, not the existence.

This may imply they refer to a Bloom filter (http://en.wikipedia.org/wiki/Bloom_filter). A Bloom filter can tell you whether an element:

  • doesn't exist
  • or possibly exists
Patrick
Bloom filters are probabilistic. Also, the question says "find or determine non-existence", so it is not just non-existence. Even if it was just non-existence, there could be false positives (i.e. it does not exist, but we say it does).
Moron
@Patrick Isn't construction of a Bloom filter O(n)?
academicRobot
@academicRobot: Yes, but I presume we can do that once and amortize it off :)
Moron
I think if this is given as an answer then it should be stated that it is a probabilistic algorithm that will help 'only' in 99% cases (well, it should be stated :). Also a strong argument that this is an actual answer lies in the example (http://en.wikipedia.org/wiki/Bloom_filter#Example) - it seems Google's Big Table uses this, so I would not be at least surprised if them or their competition/peers (everyone else?) would put this in the interview question (so I basically giving up my own answer :) )
Unreason
A: 

Your suggested solution is absolutely correct. A constant time algorithm is probably of O(k) where k is number of disks.

psihodelia
Wouldn't I still have to search the disk, making it non-constant?
Rich
O(k) is the same as O(1).
David Thornley
1) disks have limited space 2) N has no limit specified 3) that means k~N (or k grows in some other way, but still correlated) 4) that means `O(k)` is equivalent to `O(N)` 5) Rich suggested binary searching disks, then binary searching lists on that disk, that means `O(log(N) + log(k))`.
viraptor
A: 

I think you can get some faster lookup times if you allow yourself the use of some meta-data.

Set up a number of indirect blocks or lists whose elements point to more indirect blocks/lists. Repeat until you get to a desired level of direct blocks/list. The idea is to use something similar to how some file systems access their file data (direct, indirect, doubly indirect and triply indirect blocks). It is quite likely that for number ranges that they are requesting you will need more than triple indirection.

Each part of the number you are looking up can refer to a separate index in the indirect/direct tables. Eventually, you will have broken the search down far enough that you can read the final section that may or may not contain the number. Then you can search this final section with an algorithm of your choice.

Hope this helps and makes sense.

Disclaimer: I'm going to lunch in a minute and so I have not completely thought this through--it may be impractical.

Sparky
This is still O(log n) - the question looks for the algorithm to beat that.
Unreason
+3  A: 

If only using comparisons, we have an Omega(log N) (worst case) lower bound (i.e. O(1) is not possible).

Say you decide to look at some position in the array, then your adversary can place the element in the part of the array which is larger.

Thus at each step, you have at least half the elements left to consider and so Omega(logn) in the worst case.

So you would probably need to get away from using just comparisons in order to do better than O(log N) in the worst case.

As some other answer mentioned, you could do a probabilistic constant time search which gives you the right answer with reasonable probability, for e.g. using Bloom Filters.

Moron
+1 Nice proof that worst case must be O(log N)
Unreason
+2  A: 

First look

M >> N is not a hint I think, it simply discourages creating a bitmap that would directly tell you in O(1) time if a number exists.

I think that sane assumption with N spanning multiple hard disks is that you can expect that you would not have order of magnitutde more disks at your disposal. As you would need 2M space for O(1) performance, and if N spans multiple disks then M spans >> multiple disks and 2M spans >> disks than available.

Also, it tells you that approach to store the missing numbers would not be efficient since then you would have to store X numbers where

X = M - N => X ~ M (since M >> N)

which is then a worse case.

So at first look it seems that you can prove that there is no known better answer.

EDIT: I still stand at the above reasoning, which is also even better proven by Moron's answer. However the conclusion, after looking at Bloom Filter from Patrick's answer I believe that the interviewer might have been looking at this and other probabilistic algorithms (which should have been noted in the interview question).

Unreason
+4  A: 

By the letter of the question, they are probably looking for an interpolation search, which is average case O(log log n). Yes, this is O(n) in worst case, but can be improved with knowledge of the distribution or using an interpolation-binary search.

This plays into the M >> N hint. The average case analysis for interpolation search is pretty complex, so I won't even attempt a modification under M >> N. But conceptual, under M >> N and assuming a uniform distribution, you can assume that the value will be bounded by +/-1 of the initial search position, yielding O(1).

A practical implementation could do the initial interpolation once, and if the search value is not bounded, fall back to a binary search.

Not sure how multiple disks could be used to advantage in this approach, though...

academicRobot
Perhaps one step of an initial interpolation gets you to the disk with the correct bounds, then you lop off a huge portion of your array?
Rich
+1, some notes: if distribution is not uniform the interpolation/extrapolation search can still be used if you know the distribution (with uniform distribution you have linear interpolation/extrapolation, with other distributions it is not linear but dependant on distribution). building perfect distribution representation could require a lot of space so some approximation/loss of precision would have to be utilized - it is unclear if that would remove the potential benefit.
Unreason