The state of the art in GPU sorting isn't particularly encouraging.
For sorting 32-bit integers the following paper from 2009 (with 2 authors who are researchers at Nvidia) only claims 23% increase for the best CUDA sort on GTX280 compared to the best CPU sort on a 4 core Yorkfield.
http://www.mgarland.org/files/papers/gpusort-ipdps09.pdf
This used a radix sort on the GPU, and merge sort on CPU. You'd need a comparison-based sort in order to construct a suffix array, so instead of GPU radix sort the best of those in the paper would be GPU merge sort, which achieved about half the speed of GPU radix sort (with 1 million keys) - ie about 40% slower than the CPU merge sort.
Adding variable length keys seems likely to cause threads in a warp will get out of sync on a GPU, so would reduce the performance on GPU more than CPU.
Overall if your purpose is to build an efficient system, I'd recommend that you use a CPU implementation for this problem because it will be faster and easier to write.
But, if your purpose is to experiment or just to learn about GPU, then you can find the CUDA implementation of merge sort from the paper in the CUDA SDK:
http://developer.download.nvidia.com/compute/cuda/sdk/website/Data-Parallel_Algorithms.html