views:

651

answers:

4

Hi, i'm using java on a big amount of data.

[i try to simplify the problem as much as possible]

Actually i have a small class (Element) containing an int KEY and a double WEIGHT (with getters&setters).

I read a lot of these objects from a file and i have to get the best (most weight) M objects.

Actually i'm using a PriorityQueue with a Comparator written to compare two Element, and it works, but it's too slow.

Do you know (i know you do) any faster way to do that?

Thank you

+1  A: 

If M is suitably small, then sorting all elements may waste a lot of computing time. You could only put the first M objects in priority queue (e.g. a heap, minimal element on top), and then iterate over the rest of the elements: every time an element is larger than the top of the heap, remove top and push new element into the heap.

Alternatively, you could iterate over the whole array once to find a statistical threshold value for which you can be very sure there are more than M objects with a larger value (will require some assumptions regarding the values, e.g. if they are normally distributed). You can then limit sorting to all elements with a larger value.

__roland__
+2  A: 
erickson
Good suggestion. The complexity of this algorithm is O(n log m) for getting the top-m of n total elements.
Apocalisp
A: 

@Tnay: You have a point about not performing a comparison. Unfortunately, your example code still performs one. This solves the problem:

public int compare(ListElement i, ListElement j) {
    return i.getValue() - j.getValue();
}

In addition, neither yours, nor BigGs compare method is strictly correct, since they never return 0. This may be a problem with some sorting algorithms, which is a very tricky bug, since it will only appear if you switch to another implementation.

From the Java docs:

The implementor must ensure that sgn(compare(x, y)) == -sgn(compare(y, x)) for all x and y.

This may or may not perform a significant constant factor speed-up. If you combine this with erickson's solution, it will probably be hard to do it faster (depending on the size of M). If M is very large, the most efficient solution is probably to sort all the elements using Java's built-in qsort on an array and cut off one end of the array in the end.

Jørgen Fogh
And, of course, this comparator is good provided it is guaranteed that the difference between i and j never exceeds Integer.MAX_VALUE.
Software Monkey
In general, subtraction is a poor choice for implementing comparison on floating-point values (the question clearly states that the weight is a `double`). If the difference is less than one, it will incorrectly be coerced to zero when casting the result to an `int`.
erickson
@Software Monkey: True. @erickson: I hadn't noticed that we were using floating-point values.
Jørgen Fogh
+2  A: 

In addition to the suggested "peek at the top of the heap" algorithm, which gives you O(n log m) complexity for getting the top-m of n items, here are two more solutions.

Solution 1: Use a Fibonacci heap.

The JDK's PriorityQueue implementation is a balanced binary heap. You should be able to squeeze more performance out of a Fibonacci heap implementation. It will have amortized constant time insert, while inserting into a binary heap has complexity Ω(log n) in the size of the heap. If you're doing that for every element, then you're at Ω(n log n). Finding the top-m of n items using a Fib heap has complexity O(n + m log n). Combine this with the suggestion to only ever insert m elements into the heap, and you have O(n + m log m), which is as close to linear time as you're going to get.

Solution 2: Traverse the list M times.

You should be able to get the kth-largest element in a set in O(n) time. Simply read everything into a list and do the following:

kthLargest(k, xs)
  Pick a random pivot element p from the list
    (the first one will do if your list is already random).
  Go over the set once and group it into two lists.
     Left: smaller than p. 
     Right: Larger or equal to p.
  If the Right list is shorter than k, return kthLargest(k - right.size, Left)
  If the Right list is longer than k, return kthLargest(k, right)
  Otherwise, return p.

That gives you O(n) time. Running that m times, you should be able to get the top-m objects in your set in time O(nm), which will be strictly less than n log n for sufficiently large n and sufficiently small m. For example, getting the top-10 over a million items will take half as long as using a binary heap priority queue, all other things being equal.

Apocalisp
Your claim about the speed difference factor between a Fibonacci heap and a binary heap assumes a binary logarithm and no difference in constant factors, i.e. it's probably untrue.
Michael Borgwardt
Assume a spherical cow in a vacuum...
Apocalisp