views:

91

answers:

1

Array to sort has approximately one million strings, where every string can have length up to one million characters.

I am looking for any implementation of sorting algorithm for GPU.

I have a block of data with size approximately 1MB and I need to construct suffix array. Now you can see how it is possible to have one million strings inside really small amount of memory.

+1  A: 

The state of the art in GPU sorting isn't particularly encouraging.

For sorting 32-bit integers the following paper from 2009 (with 2 authors who are researchers at Nvidia) only claims 23% increase for the best CUDA sort on GTX280 compared to the best CPU sort on a 4 core Yorkfield.

http://www.mgarland.org/files/papers/gpusort-ipdps09.pdf

This used a radix sort on the GPU, and merge sort on CPU. You'd need a comparison-based sort in order to construct a suffix array, so instead of GPU radix sort the best of those in the paper would be GPU merge sort, which achieved about half the speed of GPU radix sort (with 1 million keys) - ie about 40% slower than the CPU merge sort.

Adding variable length keys seems likely to cause threads in a warp will get out of sync on a GPU, so would reduce the performance on GPU more than CPU.

Overall if your purpose is to build an efficient system, I'd recommend that you use a CPU implementation for this problem because it will be faster and easier to write.

But, if your purpose is to experiment or just to learn about GPU, then you can find the CUDA implementation of merge sort from the paper in the CUDA SDK:

http://developer.download.nvidia.com/compute/cuda/sdk/website/Data-Parallel_Algorithms.html

RD1
Isn't the whole point of CUDA to use a processor that is idle anyway? Even if you got no speed improvement whatsoever on a GPU over a CPU, you would still have a 2X improvement over having a CPU only, provided you can effectively utilize the extra parallelism.
Robert Harvey
@Robert Harvey - most uses of CUDA don't keep the CPU busy at the same time. However recently this has become more common, and is usually called hybrid GPU/CPU. The need to copy in between the CPU and GPU memories tends to make it quite tricky to obtain good performance though. In this case, I'd expect at best you'd achieve 150% of the CPU speed, and you'd be better off investing in a system with two CPUs.
RD1
Thanks for your answer.I agree with all your notes about sorting strings on a GPU, I thought in the same way, but I had hoped that there is an algorithm which I had missed.
Kentzo