views:

140

answers:

3

I have a large table of N items with M (M>=3) distinct properties per item, From this table I have to remove all items for which the same table contains an item that scores equal or better on all properties.

I have an algorithm (python) that solves it already, but it is output-sensitive and has a worst case of approx. O((n²+n)/2) when no items are removed in the process. This is far too slow for my project (where datasets of 100,000 items with 8 properties per item are not uncommon), so I require something close to O(m*n log n) worst case, but I do not know whether this problem can be solved that fast.

Example problem case and its solution:

  [higher value = better]
    Singing  Dancing  Acting
 A    10        20     10
 B    10        20     30
 C    30        20     10
 D    30        10     30
 E    10        30     20
 F    30        10     20
 G    20        30     10

Dismiss all candidates for which there is a candidate that performs equal or better in all disciplines.

Solution:
- A is dismissed because B,C,E,G perform equal or better in all disciplines.
- F is dismissed because D performs equal or better in all disciplines.

Does there exist an algorithm that solves this problem efficiently, and what is it?

+2  A: 

A general answer is to arrange the records into a tree and keep notes at each node of the maximum value in each column for the records lying beneath that node. Then, for each record, chase it down the tree from the top until you know whether it is dominated or not, using the notes at each node to skip entire subtrees, if possible. (Unfortunately you may have to search both descendants of a node). When you remove a record as dominated you may be able to update the annotations in the nodes above it - since this need not involve rebalancing the tree it should be cheap. You might hope to at least gain a speedup over the original code. If my memory of multi-dimensional search is correct, you could hope to go from N^2 to N^(2-f) where f becomes small as the number of dimensions increases.

One way to create such a tree is to repeatedly split the groups of records at the median of one dimension, cycling through the dimensions with each tree level. If you use a quicksort-like median search for each such split you might expect the tree construction to take you time n log n. (kd-tree)

One tuning factor on this is not to split all the way down, but to stop splitting when the group size gets to some N or less.

mcdowella
After toying with an implementation using kd-trees, it seems to offer quite a solid improvement over what I had, conceptually.+1.
Entity
+2  A: 

It looks like this paper, http://flame.cs.dal.ca/~acosgaya/Research/skyline/on%20finding%20the%20maxima%20of%20a%20set%20of%20a%20vectors.pdf addresses your problem.

abeacham
I think algorithm 3.1 in this paper is busted. If you give it the input [(0, 1), (1, 0)] it returns only [(0, 1)]. It should of course retain both. The recursive call to maxima() in step 3 is the problem. It removes vectors that correspond to maxima if you look at all *d* dimensions. The correctness proof doesn't mention that call to maxima() at all. :-P
Jason Orendorff
Oh, I think it's just a typo. The theorem says, "The vector v_i is a maximal element of V if and only if v_i* ∈ T_n." But the *n* at the end should be an *i*.
Jason Orendorff
So this works, or doesn't? I have to admit having some trouble wrapping my head around this sort of paper
Entity
Yeah, me too. It's not the easiest reading. It'll work, but you might have to spend a day or two on it. The implementation will not be simple. Performance is O(n log n) if there are just 2 or 3 dimensions; O(n (log n)^(d-2)) for *d* dimensions.
Jason Orendorff
I just realised this algorithm is not really equipped to deal with a high number of dimensions.. Someone check my math here:The d-dimensions performance tells me that in order for this algorithm's worst case scenario to beat O(n^2), d must be less than (2*log(log2(n))+log(n))/log(log2(n)), because that is the point where n = log2(n)^(d-2).Now, I have a set of 11 dimensions, which suggests that in order for this algorithm to beat an O(n^2) worst case, it would have to contain more than 2362955301000000 entries because log2(2362955301000000)^(11-2) = 2362955301000000.Right?
Entity
A: 

What you have here is a partially ordered set so that A <= B if all traits of A have values less than or equal to B, and A >= B if all traits of A have values greater than or equal to B. It is possible that !(A<=B || A>=B), in this case A and B are "incomparable". Your problem is to eliminate from the set those elements which are dominated by other elements, e.g. remove every A s.t. there exists B in the set so that A < B.

In the worst case all the elements are incomparable, i.e. you can't eliminate anything. Now let's look at the incomparability relationship. Suppose A !~ B (incomparability) and B !~ C. Is it possible that A and C are still comparable? Yes! For example A could have traits {1,2,3}, B {2,1,5} and C {2,3,4}. This means that incomparability is not "transitive" and therefore you are kind of out of luck; in general to check that all the elements are incomparable IS going to take O(N^2) time, as far as I understand.

antti.huima