Most of the documentation that comes with the CUDA toolkit & SDK downloads are about CUDA generally, not CuBLAS specifically. Start with the CUBLAS_Library_2.3.pdf
file if you're just going to use CuBLAS--you won't need to write your own CUDA kernels. If you're already using a CPU BLAS, CuBLAS shouldn't be difficult to pick up. (And if you're not, then consider trying an optimized CPU one before CuBLAS, since it will be easier to program).
If you're coding on .NET, then the easiest way to use CuBLAS is probably via platform-invoke calls into cublas.dll. Be sure to keep straight which arrays are in host (CPU) memory, and which are in device (GPU) memory.
Keep in mind that CUDA & CuBLAS aren't magic bullets. Performance depends on a lot of factors (especially transfers across the PCIe bus), and simply swapping CUBLAS calls for CPU-BLAS calls may not give you speedups. You may have to make more substantial changes to your own code to get performance improvements. Those other guides you mention are very useful for understanding the CUDA architecture and its bottlenecks.
EDIT: I wasn't clear about the boundary between user code and kernel code. CUBLAS is a library of pre-built, optimized CUDA kernels. If you only need BLAS functionality, you do not need to write your own kernels. Instead, just call CUBLAS functions. When performance tuning, you shouldn't need to tweak the CUBLAS kernels, but you may need to change how and when you call them, and how you use memory, so as to minimize the number of transfers across the PCI express bus.