Generally everything I come across 'on-the-net' with relation to SSE/MMX comes out as maths stuff for vectors and matracies. However, I'm looking for libraries of SSE optimized 'standard functions', like those provided by Agner Fog, or some of the SSE based string scanning algorithms in GCC.
As a quick general rundown: these would be things like memset, memcpy, strstr, memcmp BSR/BSF, ie an stdlib-esque built from SSE intrsuctions
I'd preferably like them to be for SSE1(formally MMX2) using intrinsics rather than assembly, but either is fine. hopefully this not too broad a spectrum.
Update 1
I came across some promising stuff after sopme searching, one library caught my eye:
- LibFreeVec: seems mac/IBM only(due to being AltiVec based), thus of little use(to me), plus I can't seem to find a direct download link, nor does it state the minimum supported SSE version
I also came across an article on a few vectorised string functions(strlen, strstr strcmp). however SSE4.2 is way out of my reach(as said before, I'd like to stick to SSE1/MMX)
Update 2
Paul R motivated me to do a little benchmarking, unfortunately as my SSE assembly coding experience is close to zip, I used someone else's ( http://www.mindcontrol.org/~hplus/ ) benchmarking code and added to it. All tests(excluding the original, which is VC6 SP5) where compiled under VC9 SP1 with full/customized optimizations and /arch:SSE
on.
First test was one my home machine(AMD Sempron 2200+ 512mb DDR 333), capped at SSE1(thus no vectorization by MSVC memcpy):
comparing P-III SIMD copytest (blocksize 4096) to memcpy
calculated CPU speed: 1494.0 MHz
size SSE Cycles thru-sse memcpy Cycles thru-memcpy asm Cycles thru-asm
1 kB 2879 506.75 MB/s 4132 353.08 MB/s 2655 549.51 MB/s
2 kB 4877 598.29 MB/s 7041 414.41 MB/s 5179 563.41 MB/s
4 kB 8890 656.44 MB/s 13123 444.70 MB/s 9832 593.55 MB/s
8 kB 17413 670.28 MB/s 25128 464.48 MB/s 19403 601.53 MB/s
16 kB 34569 675.26 MB/s 48227 484.02 MB/s 38303 609.43 MB/s
32 kB 68992 676.69 MB/s 95582 488.44 MB/s 75969 614.54 MB/s
64 kB 138637 673.50 MB/s 195012 478.80 MB/s 151716 615.44 MB/s
128 kB 277678 672.52 MB/s 400484 466.30 MB/s 304670 612.94 MB/s
256 kB 565227 660.78 MB/s 906572 411.98 MB/s 618394 603.97 MB/s
512 kB 1142478 653.82 MB/s 1936657 385.70 MB/s 1380146 541.23 MB/s
1024 kB 2268244 658.64 MB/s 3989323 374.49 MB/s 2917758 512.02 MB/s
2048 kB 4556890 655.69 MB/s 8299992 359.99 MB/s 6166871 484.51 MB/s
4096 kB 9307132 642.07 MB/s 16873183 354.16 MB/s 12531689 476.86 MB/s
Second test batch was done on a university workstation(Intel E6550, 2.33Ghz, 2gb DDR2 800?)
VC9 SSE/memcpy/ASM:
comparing P-III SIMD copytest (blocksize 4096) to memcpy
calculated CPU speed: 2327.2 MHz
size SSE Cycles thru-sse memcpy Cycles thru-memcpy asm Cycles thru-asm
1 kB 392 5797.69 MB/s 434 5236.63 MB/s 420 5411.18 MB/s
2 kB 882 5153.51 MB/s 707 6429.13 MB/s 714 6366.10 MB/s
4 kB 2044 4447.55 MB/s 1218 7463.70 MB/s 1218 7463.70 MB/s
8 kB 3941 4613.44 MB/s 2170 8378.60 MB/s 2303 7894.73 MB/s
16 kB 7791 4667.33 MB/s 4130 8804.63 MB/s 4410 8245.61 MB/s
32 kB 15470 4701.12 MB/s 7959 9137.61 MB/s 8708 8351.66 MB/s
64 kB 30716 4735.40 MB/s 15638 9301.22 MB/s 17458 8331.57 MB/s
128 kB 61019 4767.45 MB/s 31136 9343.05 MB/s 35259 8250.52 MB/s
256 kB 122164 4762.53 MB/s 62307 9337.80 MB/s 72688 8004.21 MB/s
512 kB 246302 4724.36 MB/s 129577 8980.15 MB/s 142709 8153.80 MB/s
1024 kB 502572 4630.66 MB/s 332941 6989.95 MB/s 290528 8010.38 MB/s
2048 kB 1105076 4211.91 MB/s 1384908 3360.86 MB/s 662172 7029.11 MB/s
4096 kB 2815589 3306.22 MB/s 4342289 2143.79 MB/s 2172961 4284.00 MB/s
As can be seen, SSE is very fast on my home system, but falls on the intel machine(probably due to bad coding?). my x86 assembly variant comes in second on my home machine, and second on the intel system(but the results look a bit inconsistent, one hug blocks it dominates the SSE1 version). the MSVC memcpy wins the intel system tests hands done, this is due to SSE2 vectorization though, on my home machine, it fails dismally, even the horrible __movsd
beats it...
pitfalls: the memory was all aligned powers of 2. cache was (hopefully) flushed. rdtsc was used for timing.
points of interest: MSVC has an (unlisted in any reference) __movsd intrinsic, it outputs the same assembly code I'm using, but it fails dismally(even when inlined!). probably why its unlisted. the VC9 memcpy can be forced to vectorize on my non-sse 2 machine, it will however corrupt the FPU stack, it also seems to have a bug.
this is the full source to what I used to test(including my alterations, again, credit to http://www.mindcontrol.org/~hplus/ for the origional). the binaries an project files are available on request.
In conclusion, it seems a switching variant might be the best, similar to the MSVC crt one, only a lot more sturdy with more options and single once-off checks(via inline'd function pointers? or something more devious like internal direct call patch), however inlining would probably have to use a best case method instead
Update 3
A question asked by Eshan reminded of something useful and related to this, although only for bit sets and bit ops, BitMagic and be quite useful for large bit sets, it even has a nice article on SSE2 (bit) optimization. Unfortunatly, this still isn't a CRT/stdlib esque type library. its seems most of these projects are dedicated to a specific, small section (of problems).
This raises the question, would it then rather be worth will to create a open-source, probably multi-platform performance crt/stdlib project, creating various versions of the standardised functions, each optimizied for certian situation as well as a 'best-case'/general use variant of the function, with either runtime branching for scalar/MMX/SSE/SSE2+ (al la MSVC) or a forced compile time scalar/SIMD switch. This could be useful for HPC, or projects where every bit of performance counts(like games), freeing the programmer from worrying about the speed of the inbuilt functions, only requiring a small bit of tuning to find the optimal optimized variant.
Update 4
I think the nature of this question should be expanded, to include techniques that can be applied using SSE/MMX to optimization for non-vector/matrix applications, this could probably be used for 32/64bit scalar code as well. an good example is how to check for the occurance of a byte in a given 32/64/128/256 bit data type, at once using scalar techniques(bit manip), MMX & SSE/SIMD
Also, I see a lot of answers along the lines of "just use ICC", and thats a good answer, it not my kinda of answer, as firstly, ICC is not something I can use continuosly(unless intel have a free student version for windows), due to the 30 trial. secondly, and more pertiantly, I'm not only after the libraries its self, but the techniques used to optimize/create the functions they contain, for my personal eddification and improvement, and so I can apply such techniques and principles to my own code(where needed), in conjuntion with the use of these libraries. hopefully that clears up that part :)