views:

823

answers:

3

I'm considering porting a large chunk of processing to the GPU using a GLSL shader. One of the immediate problems I stumbled across is that in one of the steps, the algorithm needs to maintain a list of elements, sort them and take the few largest ones (which number is dependent on the data). On the CPU this is simply done using an STL vector and qsort() but in GLSL I don't have such facilities. Is there a way to deal with this deficiency?

+5  A: 

Have you seen this article? http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter46.html

I wasn't sure you were looking for a Quicksort algorithm or a quick sorting algorithm. The algorithm in the article uses merge sort...

Björn
Yeah, I think MergeSort makes much more sense to run on a SIMD platform (due to memory localization) than QuickSort.
Mehrdad Afshari
I was rather looking for a complete sort within one pass because the sorting is only one step in my algorithm which should run for every fragment.
shoosh
Very good answer. The algorithms in the article are good. Bitonic sorter FTW :-)
ypnos
+7  A: 

Disclosure: I really don't know GLSL -- I've been doing GPGPU programming with the AMD Stream SDK, which has different programming language.

From you comment on Bjorn's answer, I gather that you are not interested in using the GPU to sort a huge database -- like creating a reverse phone book or whatever, but instead, you have a small dataset and each fragment has it's own dataset to sort. More like trying to do median pixel filtering?

I can only say in general:

For small datasets, the sort algorithm really doesn't matter. While people have spent careers worrying about which is the best sort algorithm for very large databases, for small N it really doesn't matter whether you use Quick sort, Heap Sort, Radix Sort, Shell Sort, Optimized Bubble Sort, Unoptimized Bubble sort, etc. At least it doesn't matter much on a CPU.

GPUs are SIMD devices, so they like to have each kernel executing the same operations in lock step. Calculations are cheap but branches are expensive and data-dependent branches where each kernel branchs a different way is very, very, very, expensive.

So if each kernel has it's own small dataset to sort, and the # of data to sort is data dependent and it could be a different number for each kernel, you're probably better off picking a maximum size (if you can), padding the arrays with Infinity or some large number, and having each kernel perform the exact same sort, which would be an unoptimized branchless bubble sort, something like this:

Pseudocode (since I don't know GLSL), sort of 9 points

#define TwoSort(a,b) { tmp = min (a, b); b = a + b - tmp; a = tmp; }
for (size_t n = 8; n ; --n) {
  for (size_t i = 0; i < n; ++i) {
    TwoSort (A[i], A[i+1]);
  }
}
Die in Sente
Very nice. This is exactly what I was looking for. Do you have any references for the disadvantages of data dependent branches?
shoosh
I don't have any references off the top of my head. BTW, another reason quicksort won't work on GPUs is they don't support recursion.
Die in Sente
+1  A: 

I haven't got any knowledge about GPU programming.

I'd use heapsort rather than quicksort, because you said you only need to look at the top few values. The heap can be constructed in O(n) time, but getting the top value is log(n). Therefore if you the number of values you need is significantly smaller than the total number of elements you could gain some performance.

Georg