I think the others have answered your second question. As for the first, the "Hello World" of CUDA, I don't think there is a set standard, but personally, I'd recommend a parallel adder (i.e. a programme that sums N integers).
If you look the "reduction" example in the NVIDIA SDK, the superficially simple task can be extended to demonstrate numerous CUDA considerations such as coalesced reads, memory bank conflicts and loop unrolling.
See this presentation for more info:
http://www.gpgpu.org/sc2007/SC07_CUDA_5_Optimization_Harris.pdf