views:

178

answers:

7

Generally everything I come across 'on-the-net' with relation to SSE/MMX comes out as maths stuff for vectors and matracies. However, I'm looking for libraries of SSE optimized 'standard functions', like those provided by Agner Fog, or some of the SSE based string scanning algorithms in GCC.

As a quick general rundown: these would be things like memset, memcpy, strstr, memcmp BSR/BSF, ie an stdlib-esque built from SSE intrsuctions

I'd preferably like them to be for SSE1(formally MMX2) using intrinsics rather than assembly, but either is fine. hopefully this not too broad a spectrum.

Update 1

I came across some promising stuff after sopme searching, one library caught my eye:

  • LibFreeVec: seems mac/IBM only(due to being AltiVec based), thus of little use(to me), plus I can't seem to find a direct download link, nor does it state the minimum supported SSE version

I also came across an article on a few vectorised string functions(strlen, strstr strcmp). however SSE4.2 is way out of my reach(as said before, I'd like to stick to SSE1/MMX)

Update 2

Paul R motivated me to do a little benchmarking, unfortunately as my SSE assembly coding experience is close to zip, I used someone else's ( http://www.mindcontrol.org/~hplus/ ) benchmarking code and added to it. All tests(excluding the original, which is VC6 SP5) where compiled under VC9 SP1 with full/customized optimizations and /arch:SSE on.

First test was one my home machine(AMD Sempron 2200+ 512mb DDR 333), capped at SSE1(thus no vectorization by MSVC memcpy):

comparing P-III SIMD copytest (blocksize 4096) to memcpy
calculated CPU speed: 1494.0 MHz
  size  SSE Cycles      thru-sse    memcpy Cycles   thru-memcpy     asm Cycles      thru-asm
   1 kB 2879        506.75 MB/s     4132        353.08 MB/s     2655        549.51 MB/s
   2 kB 4877        598.29 MB/s     7041        414.41 MB/s     5179        563.41 MB/s
   4 kB 8890        656.44 MB/s     13123       444.70 MB/s     9832        593.55 MB/s
   8 kB 17413       670.28 MB/s     25128       464.48 MB/s     19403       601.53 MB/s
  16 kB 34569       675.26 MB/s     48227       484.02 MB/s     38303       609.43 MB/s
  32 kB 68992       676.69 MB/s     95582       488.44 MB/s     75969       614.54 MB/s
  64 kB 138637      673.50 MB/s     195012      478.80 MB/s     151716      615.44 MB/s
 128 kB 277678      672.52 MB/s     400484      466.30 MB/s     304670      612.94 MB/s
 256 kB 565227      660.78 MB/s     906572      411.98 MB/s     618394      603.97 MB/s
 512 kB 1142478     653.82 MB/s     1936657     385.70 MB/s     1380146     541.23 MB/s
1024 kB 2268244     658.64 MB/s     3989323     374.49 MB/s     2917758     512.02 MB/s
2048 kB 4556890     655.69 MB/s     8299992     359.99 MB/s     6166871     484.51 MB/s
4096 kB 9307132     642.07 MB/s     16873183        354.16 MB/s     12531689    476.86 MB/s

full tests

Second test batch was done on a university workstation(Intel E6550, 2.33Ghz, 2gb DDR2 800?)

VC9 SSE/memcpy/ASM:
comparing P-III SIMD copytest (blocksize 4096) to memcpy
calculated CPU speed: 2327.2 MHz
  size  SSE Cycles      thru-sse    memcpy Cycles   thru-memcpy     asm Cycles      thru-asm
   1 kB 392         5797.69 MB/s    434         5236.63 MB/s    420         5411.18 MB/s
   2 kB 882         5153.51 MB/s    707         6429.13 MB/s    714         6366.10 MB/s
   4 kB 2044        4447.55 MB/s    1218        7463.70 MB/s    1218        7463.70 MB/s
   8 kB 3941        4613.44 MB/s    2170        8378.60 MB/s    2303        7894.73 MB/s
  16 kB 7791        4667.33 MB/s    4130        8804.63 MB/s    4410        8245.61 MB/s
  32 kB 15470       4701.12 MB/s    7959        9137.61 MB/s    8708        8351.66 MB/s
  64 kB 30716       4735.40 MB/s    15638       9301.22 MB/s    17458       8331.57 MB/s
 128 kB 61019       4767.45 MB/s    31136       9343.05 MB/s    35259       8250.52 MB/s
 256 kB 122164      4762.53 MB/s    62307       9337.80 MB/s    72688       8004.21 MB/s
 512 kB 246302      4724.36 MB/s    129577      8980.15 MB/s    142709      8153.80 MB/s
1024 kB 502572      4630.66 MB/s    332941      6989.95 MB/s    290528      8010.38 MB/s
2048 kB 1105076     4211.91 MB/s    1384908     3360.86 MB/s    662172      7029.11 MB/s
4096 kB 2815589     3306.22 MB/s    4342289     2143.79 MB/s    2172961     4284.00 MB/s

full tests

As can be seen, SSE is very fast on my home system, but falls on the intel machine(probably due to bad coding?). my x86 assembly variant comes in second on my home machine, and second on the intel system(but the results look a bit inconsistent, one hug blocks it dominates the SSE1 version). the MSVC memcpy wins the intel system tests hands done, this is due to SSE2 vectorization though, on my home machine, it fails dismally, even the horrible __movsd beats it...

pitfalls: the memory was all aligned powers of 2. cache was (hopefully) flushed. rdtsc was used for timing.

points of interest: MSVC has an (unlisted in any reference) __movsd intrinsic, it outputs the same assembly code I'm using, but it fails dismally(even when inlined!). probably why its unlisted. the VC9 memcpy can be forced to vectorize on my non-sse 2 machine, it will however corrupt the FPU stack, it also seems to have a bug.

this is the full source to what I used to test(including my alterations, again, credit to http://www.mindcontrol.org/~hplus/ for the origional). the binaries an project files are available on request.

In conclusion, it seems a switching variant might be the best, similar to the MSVC crt one, only a lot more sturdy with more options and single once-off checks(via inline'd function pointers? or something more devious like internal direct call patch), however inlining would probably have to use a best case method instead

Update 3

A question asked by Eshan reminded of something useful and related to this, although only for bit sets and bit ops, BitMagic and be quite useful for large bit sets, it even has a nice article on SSE2 (bit) optimization. Unfortunatly, this still isn't a CRT/stdlib esque type library. its seems most of these projects are dedicated to a specific, small section (of problems).

This raises the question, would it then rather be worth will to create a open-source, probably multi-platform performance crt/stdlib project, creating various versions of the standardised functions, each optimizied for certian situation as well as a 'best-case'/general use variant of the function, with either runtime branching for scalar/MMX/SSE/SSE2+ (al la MSVC) or a forced compile time scalar/SIMD switch. This could be useful for HPC, or projects where every bit of performance counts(like games), freeing the programmer from worrying about the speed of the inbuilt functions, only requiring a small bit of tuning to find the optimal optimized variant.

Update 4

I think the nature of this question should be expanded, to include techniques that can be applied using SSE/MMX to optimization for non-vector/matrix applications, this could probably be used for 32/64bit scalar code as well. an good example is how to check for the occurance of a byte in a given 32/64/128/256 bit data type, at once using scalar techniques(bit manip), MMX & SSE/SIMD

Also, I see a lot of answers along the lines of "just use ICC", and thats a good answer, it not my kinda of answer, as firstly, ICC is not something I can use continuosly(unless intel have a free student version for windows), due to the 30 trial. secondly, and more pertiantly, I'm not only after the libraries its self, but the techniques used to optimize/create the functions they contain, for my personal eddification and improvement, and so I can apply such techniques and principles to my own code(where needed), in conjuntion with the use of these libraries. hopefully that clears up that part :)

+1  A: 

For simple operations such as memset, memcpy, etc, where there is very little computation, there is little point in SIMD optimisation, since memory bandwidth will usually be the limiting factor.

Paul R
But for memory ops your not leveraging the power of the coprocessor for processing, but rather for its ability to operate on much larger data sets(8, 16 + bytes at a time) with the same latency as using the inbuilt x86 instructions. Dr Fog should have some comparisons showing this somewhere in his 5 volume 'guide'. And yes, i'm aware this only matters for hotspots, and thats what i'm using this for
Necrolis
@Necrolis: it doesn't matter how much more efficient your loads/stores are - if you can max out your memory bandwidth with scalar code (which is usually pretty easy with e.g. memcpy, memset) then there is nothing to be gained from further optimisation.
Paul R
the thing is I'm not maxing it out with scalar code, except for very small buffers(though I attribute this in part to MSVC not inlining calls to memset etc. if this wasn't the case, __assume can be used to 'force' aligned copies, removing the branching, ie: why bother with word and byte cases when everything is multiples of long, then it would probably be very close to SSE, atleast on my system)
Necrolis
@Necrolis: it may well be that you have inefficient implementations of memset/memcpy, but that still does not justify a SIMD implementation - you can almost certainly write a more efficient scalar implementation of these routines that *will* max out memory bandwidth without resorting to SIMD. However it's an interesting exercise and you'll learn a lot in the process so if you have the time and inclination then go for it.
Paul R
@Paul R - I do have some of these SSEx general-purpose functions that do outperform the respective non-SSE versions, although I don't know _how_ far from optimal the non-SSE versions are. I was therefore wondering if you have any data to support your claim re. scalar code being as performant as SSE? Or could you point us to some scalar code, that in your view, does max out memory bandwidth?
PhiS
Necrolis
Just noticed I replied to PhiS' comment without looking who it was addressed to, my reply however still holds, libfreevec shows one or two functions in its benchmarks that are dominated by a glibc 32 variant, but the rest are all won by SSE versions
Necrolis
@PhiS: I do have relevant data but it's work-related and I'm not at liberty to share it. To be fair I think there are some very specific cases where you can get a small improvement with SIMD, but it's very architecture-specific and the gains are small, which makes a general purpose implementation a little tricky and hard to justify. The big wins with SIMD are with computational code of course, where you can see an order of magnitude performance boost, compared to maybe 10% - 20% at best for a SIMD memcpy.
Paul R
@Necrolis: how big is the speed difference you are seeing ? I'm guessing its small ? Out of interest you might want to try using the Intel ICC compiler - see if you can beat its memset/memcpy with anything hand-rolled.
Paul R
@Paul R - thanks for the info. I agree that in a fair number of cases the gains aren't huge, but then again, a 20% performance improvement can be significant.
PhiS
@Paul R - I just checked, the range of speed improvements (SIMD/non-SIMD ratios) vary from ~1.5x to ~22x (!) in my test cases (most of these are SSE2 or SSSE3).
PhiS
@PhiS: 22x sounds a little suspect ! Is this just memcpy or do you have other library functions too ? And have you tried calculating the throughput in GB/sec and comparing this with your cache and/or DRAM bandwidth ?
Paul R
@Paul R: gonna update the post with some tests, showing SSE(1) vs MSVC crt vs my own x86-asm version. unfortunately no ICC, as I don't have access to it(gotta pay for windows versions last I looked)
Necrolis
@Necrolis: you can get a free 30 day evaluation license for ICC from Intel. Also I believe the Linux version is free for non-commercial use. Note that MSVC is a pretty poor compiler and is probably quite misleading for any kind of baseline measurement of performance - better to use gcc or ICC.
Paul R
@Paul R: checking out that ICC trial, thanks for the pointer(will add GCC 4.5.1 too). as for MSVC being 'a bad compiler' I don't think thats true at all, yes it has its pitfalls, most of the time those can be gotten around. its scalar optimization is very good (it used to be streaks ahead of GCC, now they are on par), its vectorization leaves lots to be desired though. IMO most 'bad tests' come from poor setup/project options and poor code, as I spend a lot of time looking at MSVC code, and it comes out pretty clean and well optimized, sometimes it can break though, unfortunatly....
Necrolis
@Necrolis: MSVC generates really bad code for modern CPUs (Core 2 Duo, Core i7, etc) - try running the same code on MSVC and ICC to see the difference. It also generates horrible SSE code. Not to mention the fact that it *still* doesn't support C99 and has horrible and unnecessary ABI restrictions which make SSE coding unnecessarily cumbersome. I guess a lot of people like it because it's what they are used to.
Paul R
@Paul R: I have no argument about the SSE code, hence why I'm looking for SSE functions to avoid msvc's ones :P the lack of C99 is annoying, but not deal breaking. unfortunaly I don't have access to any 'recent' CPU's other than the cor 2 Duo's at uni, but I'll give it a try, see what pops out, got any suggestions for a good peice of code to use as a 'playground'/testbed in this regard?
Necrolis
@Necrolis: SSE doesn't really get interesting until you have at least a Core 2 Duo and SSSE3. Prior to that it was a kludge (128 bit operations were really 2 x 64 bit operations, and the instruction set was very limited). I guess you can still play with some of the basics though, even if you have an old PC, but if you're serious about SIMD and performance then you might want to look at getting something a little more up-to-date.
Paul R
@Paul R: yeah, unfortunatly it seems all the really 'cool' stuff is out of my reach, and being a student in a country with a horrid exchange rate makes it that much harder :| been running into the same stuff when it comes to HLSL and pixel shaders, makes one curse the advancement of technology sometimes :)
Necrolis
@Necrolis: commiserations - all I can suggest is writing to [email protected]. ;-)
Paul R
@Paul R: HAHAHA, now you'd made me curious as to whether thats a valid email address, though I'd be more inclined to ask for a nice vacation programming job, then I can get some exp AND a new PC :)
Necrolis
@Paul R - no, 22x wasn't for memcpy, but for basic string-scanning functions (pre-SSE4, though).
PhiS
+1  A: 

Maybe libSIMDx86?

http://simdx86.sourceforge.net

frank
Although a nice library, its mainly geared toward matrix and vector math(the only parts of interest to me in it are the 3 rooting functions from the math section).
Necrolis
+1  A: 

You can use the apple's or OpenSolaris's libc. These libc implementations contain what you are looking for. I was looking for these kind of things some 6 years back and I had to painfully write it the hard-way.

Ages ago I remember following a coding contest called 'fastcode' project. They did some awesome ground breaking optimisation for that time using Delphi. See their results page. Since it is written in Pascal's fast function call-model (copying arguments to registers) converting to C styled stdc function call-models (pushing on stack) may be a bit awkward. This project has no updates since a long-time especially, no code is written for SSE4.2.

Solaris -> src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/

Apple -> www.opensource.apple.com/source/Libc/Libc-594.9.1/
Eshan
these look promising, unfortunately I don't have time to go spelunking around the Apple/Solaris libs(looks like a maze of folders to me). the fast code looks real good though, pity not every thing there seems to have source code
Necrolis
Just a small note: Most such implementations put each function in their own file. So all you need to search is for some directory which mentions the platform architecture say 'x86' or 'i386' and search for file names which end with '.s'.
Eshan
@Necrolis, @Paul R: Did you people bump into similar high-speed optimisation using GPUs like nVidia, or ATI? Is it possible? Heard a lot about it, but never had a chance to see any assembly stuff that actually makes use of it. At best I end up with OpenGL or DirectX calls but nothing below that.
Eshan
Necrolis
+2  A: 

Here's an article on how to use SIMD instructions to vectorize the counting of characters:

http://porg.es/blog/ridiculous-utf-8-character-counting

carlo
+1, very nice, just a pitty its SSE2 and one needs to map GCC built-ins to MSVC :(
Necrolis
A: 

I personally wouldn't bother trying to write super-optimized versions of libc functions trying to handle every possible scenario with good performance.

Instead, write optimized versions for specific situations, where you know enough about the problem at hand to write proper code... and where it matters. There's a semantic difference between memset and ClearLargeBufferCacheWriteThrough.

snemarch
yes, thats why I mention both a best-case for general use, and versions that are far more specific(and configurable via defines). I think I'm just gonna start something on github during the christmas break, see if i can address this, my problem then boils down to my SSE knowledge being poor, my x86 optimization knowledge on the other hand is very strong.
Necrolis
+1  A: 

Honestly, what I would do is just install the Intel C++ Compiler and learn the various automated SIMD optimization flags available. We've had very good experience optimizing code performance by simply compiling it with ICC.

Keep in mind that the entire STL library is basically just header files, so the whole thing is compiled into your exe/lib/dll, and as such can be optimized however you like.

ICC has many options and lets you specify (at the simplest) which SSE levels to target. You can also use it to generate a binary file with multiple code paths, such that if the optimal SSE configuration you compiled against isn't available, it'll run a different set of (still optimized) code configured for a less capable SIMD CPU.

Computer Guru
of course I'd have to do that all within 30 days, as I don't have the money to purchase a 'full' licence. I decided one path is to do what you recommeneded, but using GCC 4.5.x instead. however its still time consuming, and I was hoping someone had already gone through part of this. also, the STD library isn't always shoved(statically linked) in the binary, with MSVC, it'll link to msvcrtxx.dll for most non-trivial functions.
Necrolis
GCC doesn't do the same optimizations ICC does - that's why there's a copy of ICC for linux and specially compiled Linux kernels that tout the fact they've been compiled with ICC. I didn't mean to say STD but rather STL. STL is always statically linked, IIRC.
Computer Guru
+1  A: 

Here's a fast memcpy implementation in C that can replace the standard library version of memcpy if necessary:

http://www.danielvik.com/2010/02/fast-memcpy-in-c.html

jefd
its a nice link, however, his version falls quite hard, my assembly version goes at almost double the speed, and sometimes more than double(first is his, under thru-c, second set is mine, under thru-asm): http://necrolis.pastebin.com/pAFzJYr7
Necrolis