How do you evaluate the efficiency of an algorithm, if the problem space is underspecified?

tags:

algorithm

views:

117

answers:

+5 Q:

How do you evaluate the efficiency of an algorithm, if the problem space is underspecified?

There was a post on here recently which posed the following question:

You have a two-dimensional plane of (X, Y) coordinates. A bunch of random points are chosen. You need to select the largest possible set of chosen points, such that no two points share an X coordinate and no two points share a Y coordinate.

This is all the information that was provided.

There were two possible solutions presented.

One suggested using a maximum flow algorithm, such that each selected point maps to a path linking (source → X → Y → sink). This runs in O(V³) time, where V is the number of vertices selected.

Another (mine) suggested using the Hungarian algorithm. Create an n×n matrix of 1s, then set every chosen (x, y) coordinate to 0. The Hungarian algorithm will give you the lowest cost for this matrix, and the answer is the number of coordinates selected which equal 0. This runs in O(n³) time, where n is the greater of the number of rows or the number of columns.

My reasoning is that, for the vast majority of cases, the Hungarian algorithm is going to be faster; V is equal to n in the case where there's one chosen point for each row or column, and substantially greater for any case where there's more than that: given a 50×50 matrix with half the coordinates chosen, V is 1,250 and n is 50.

The counterargument is that there are some cases, like a 10⁹×10⁹ matrix with only two points selected, where V is 2 and n is 1,000,000,000. For this case, it takes the Hungarian algorithm a ridiculously long time to run, while the maximum flow algorithm is blinding fast.

Here is the question: Given that the problem doesn't provide any information regarding the size of the matrix or the probability that a given point is chosen (so you can't know for sure) how do you decide which algorithm, in general, is a better choice for the problem?

+1 A:

Given that you don't know what each pill does, do you take the red pill or the blue pill?

If there really is not enough information to decide, there is not enough information to decide. Any guess is as good as any other.

Maybe, in some cases, it is possible to divine extra information to base the decision on. I haven't studied your example in detail, but it seems like the Hungarian algorithm might have higher memory requirements. This might be a reason to go with the maximum flow algorithm.

Thomas 2010-07-25 16:51:00

+1 A:

You don't. I think you illustrated that clearly enough. I think the proper practical solution is to spawn off both implementations in different threads, and then take the response that comes back first. If you're more clever, you can heuristically route requests to implementations.

Many algorithms require huge amounts of memory beyond the physical maximum of a machine, and in these cases, the algorithmically more ineffecient in time but efficient in space algorithm is chosen.

Given that we have distributed parallel computing, I say you just let both horses run and let the results speak for themselves.

Stefan Kendall 2010-07-25 16:52:18

Theoretically, they are both the same, because you actually compare how the number of operations grows when the size of the problem is increased to infinity.

The way your problem is defined, it has 2 sizes - n and number of points, so this question has no answer.

m1tk4 2010-07-25 16:55:33

+2 A:

You can't, it's an imponderable.

You can only define which is better "in general" by defining what inputs you will see "in general". So for example you could whip up a probability model of the inputs, so that the expected value of V is a function of n, and choose the one with the best expected runtime under that model. But there may be arbitrary choices made in the construction of your model, so that different models give different answers. One model might choose co-ordinates at random, another model might look at the actual use-case for some program you're thinking of writing, and look at the distribution of inputs it will encounter.

You can alternatively talk about which has the best worst case (across all possible inputs with given constraints), which has the virtue of being easy to define, and the flaw that it's not guaranteed to tell you anything about the performance of your actual program. So for instance HeapSort is faster than QuickSort in the worst case, but slower in the average case. Which is faster? Depends whether you care about average case or worst case. If you don't care which case, you're not allowed to care which "is faster".

This is analogous to trying to answer the question "what is the probability that the next person you see will have an above (mean) average number of legs?".

We might implicitly assume that the next person you meet will be selected at random with uniform distribution from the human population (and hence the answer is "slightly less than one", since the mean is less than the mode average, and the vast majority of people are at the mode).

Or we might assume that your next meeting with another person is randomly selected with uniform distribution from the set of all meetings between two people, in which case the answer is still "slightly less than one", but I reckon not the exact same value as the first - one-and-zero-legged people quite possibly congregate with "their own kind" very slightly more than their frequency within the population would suggest. Or possibly they congregate less, I really don't know, I just don't see why it should be exactly the same once you take into account Veterans' Associations and so on.

Or we might use knowledge about you - if you live with a one-legged person then the answer might be "very slightly above 0".

Which of the three answers is "correct" depends precisely on the context which you are forbidding us from talking about. So we can't talk about which is correct.

Steve Jessop 2010-07-25 17:02:04

+1 A:

This is a valid question, but there's no "right" answer — they are incomparable, so there's no notion of "better".

If your interest is practical, then you need to analyze the kinds of inputs that are likely to arise in practice, as well as the practical running times (constants included) of the two algorithms.

If your interest is theoretical, where worst-case analysis is often the norm, then, in terms of the input size, the O(V³) algorithm is better: you know that V ≤ n², but you cannot polynomially bound n in terms of V, as you showed yourself. Of course the theoretical best algorithm is a hybrid algorithm that runs both and stops when whichever one of them finishes first, thus its running time would be O(min(V³,n³)).

ShreevatsaR 2010-07-25 17:02:37

ansaurus

tags:

views:

answers:

How do you evaluate the efficiency of an algorithm, if the problem space is underspecified?

related questions