Various CUDA demos in the CUDA SDK refer to "scattered write". What is this scattered write and why is it so great? In contrast to what does it stand?
I'm going to use CUDA's terminology here.
scattered write
means that you're writing from each cuda thread to an arbitrary address (ie. the threads of your warp will not write in consecutive memory, e.g.). It contrasts with frame-buffer writes, which are 2d-coherent, and can be coalesced by the hardware. Those were the only writes available to GPUs until not so long ago.
They are the opposite operation of a gather read
, which reads data from scattered location, and gathers all of them prior to the warp of threads executing in a SIMD fashion on the gathered data. However, gather reads have long been available on GPUs through arbitrary texture fetches.
Scattered write is great because it allows you to write to any memory address. Previous shader impementations were usually limited in the memory addresses which a given shader program could write to.
"Whereas fragment programs in graphics APIs are limited to outputting 32 floats (RGBA * 8 render targets) at a pre-specified location, CUDA supports scattered writes - i.e. an unlimited number of stores to any address. This enables many new algorithms that were not possible using graphics APIS to perform efficiently using CUDA"
From the CUDA FAQ:
Basically it makes CUDA programs easier to write because they aren't as limited by where they can write results. Bear in mind that one of the keys to getting good performance on a GPU is exploiting memory locality. Overusing scattered writes by writing to global memory a lot will most likely impact your performance.