First off, why are you using quadwords to represent values that would fit in a 16-bit format? Leaving that aside, a couple solutions:
pshufd xmm1, xmm0, EEh
paddq xmm0, xmm1
movd temp, xmm0
or
movdqa xmm1, xmm0
psrldq xmm1, 8
paddq xmm0, xmm1
movd temp, xmm0
or
movhlps xmm1, xmm0
paddq xmm0, xmm1
movd temp, xmm0
Note that you don't actually need to use paddq
, you can get away with one of the narrower adds if you prefer.
edit summing four double quadwords -- what you have is pretty much fine. Given that you know that all the data in them fits into the low doubleword of each slot, you could try something like:
shufps xmm0, xmm2, 88h
shufps xmm4, xmm6, 88h
paddd xmm0, xmm4
psrlq xmm1, xmm0, 32
paddd xmm0, xmm1
movhlps xmm1, xmm0
paddd xmm0, xmm0
movd temp, xmm0
which may or may not prove to be faster.
As for EMMS, it's just another instruction. After any code that touches the MMX registers, before any code that uses the x87 floating-point instructions you need to have emms
.