You've raised an interesting point, because that article was written back in 2008.
On the original CUDA devices (Compute Capability 1.0 and 1.1) it was critical to access global memory using a "coalesced" pattern. This meant that if thread 0 accesses four bytes at byte address A then consecutive threads 1-15 must access addresses A+4 to A+60 respectively (*1).
The code in the article therefore gets threads 0-15 to read contiguous and increasing addresses, store into shared memory, then read from shared memory in the reverse order and write to contiguous and increasing addresses. As a result both the read and the write from/to global memory conform to the strict coalescing requirements.
Since the article was written, however, newer CUDA devices have been brought out (Compute Capability 1.2 and 1.3 and now 2.0 and 2.1) which perform some degree of automatic coalescing. Specifically, in this case it would be perfectly acceptable to read in one order and write in the reversed order - the hardware recognises that the write is a permutation of a coalesced write and simply reorders it for you.
So in summary, on a device with Compute Capability 1.2 or higher you don't need to stage via shared memory for this particular problem. Shared memory is still invaluable in many other problems, of course!
For more information you should check out the section on memory performance in the CUDA Best Practices Guide (available on the CUDA developer site) and also the advanced CUDA training sessions (for example, this recording).
*1 : Note also that address A must be aligned to a 64-byte boundary.