Hei!
I need to optimize some matrix multiplication code in c, and I'm doing it using SSE vector instructions. I also found that there exists SSE4.1 that already has instruction for dot-product, dpps.
The problem is that on machine this software is supposed to work there is an old version of gcc installed (4.1.2), which has no support for SSE4.1, but it has a processor that supports it (don't ask me why gcc version is older than processor...). So I cannot use _mm_dp_ps function.
I was playing around a bit with adding some assembler code to c. The problem is I have never before used assembler code so it's really confusing. Also is it more efficient to write all the code that is dealing with vector instructions in assembler?
So I am asking here if there are any other ways how to use dpps instruction, and if it is even worth using?