views:

648

answers:

4

What's my best bet for computing the dot product of a vector x with a large number of vectors y_i, where x and y_i are of length 10k or so.

  1. Shove the y's in a matrix and use an optimized s/dgemv routine?
  2. Or maybe try handcoding an SSE2 solution (I don't have SSE3, according to cpuinfo).

I'm just looking for general guidance here, so any suggestions will be useful.
And yes, I do need the performance. Thanks for any light.

+3  A: 

I think GPUs are specifically designed to perform operations like this quickly (among others). So you could probably make use of DirectX or OpenGL libraries to perform the vector operations. D3DXVec2Dot This will also save you CPU time.

Patrick Gryciuk
D3DXVec2Dot does not use the GPU. And beside that, you only see a speedup from GPGPU programs when you transform very large sets of data, or use very 'expensive' programs on the GPU. This is due to the cost of setting up the GPU to do the work, and then reading back the result. Every transfer of data to/from the GPU is a very costly operation.
Christopher
It is hard to beat DirectX's dot product if floats are good enough.
R Ubben
Current CPUs are also quite capable for doing this kind of processing.
Jasper Bekkers
A: 

Handcoding a SSE2 solution is not very difficult and will bring a nice speedup over a pure C routine. How much this will bring over a BLAS routine must be determined by you.

The greatest speedup is derived by structuring the data into a format, so that you can exploit data parallelism and alignment.

Christopher
A: 

Alternatives for optimised BLAS routines:

  • If you use intel compilers, you may have access to intel MKL
  • For other compilers ATLAS usually provides nice performance numbers
Kjetil Jorgensen
A: 

I use a GotoBLAS. It's the hight perfomance kernel routines. The many times better than MKL and BLAS.

vitaly
there are licensing problems with gotoblas.
Alexandre C.