Already finished my application which multiplies CRS matrix and vector (SpMV) and the only thing to do now is to count FLOPS my application did. In my opinion it's really hard to estimate number of floating point operation in case of sparse matrix - vector multiplication, because the number of multiplies in one row is really "jumpy" or fluent.
I only tried to measure time using "cudaprof" ( available in ./CUDA/bin directory) - it works fine.
Any sugestions and instruction pastes appreciated !