ansaurus

Question

Answer 1

+1 A:

For simple operations such as memset, memcpy, etc, where there is very little computation, there is little point in SIMD optimisation, since memory bandwidth will usually be the limiting factor.

Paul R 2010-10-05 10:52:53

But for memory ops your not leveraging the power of the coprocessor for processing, but rather for its ability to operate on much larger data sets(8, 16 + bytes at a time) with the same latency as using the inbuilt x86 instructions. Dr Fog should have some comparisons showing this somewhere in his 5 volume 'guide'. And yes, i'm aware this only matters for hotspots, and thats what i'm using this for

Necrolis 2010-10-05 11:04:57

@Necrolis: it doesn't matter how much more efficient your loads/stores are - if you can max out your memory bandwidth with scalar code (which is usually pretty easy with e.g. memcpy, memset) then there is nothing to be gained from further optimisation.

Paul R 2010-10-05 11:13:13

the thing is I'm not maxing it out with scalar code, except for very small buffers(though I attribute this in part to MSVC not inlining calls to memset etc. if this wasn't the case, __assume can be used to 'force' aligned copies, removing the branching, ie: why bother with word and byte cases when everything is multiples of long, then it would probably be very close to SSE, atleast on my system)

Necrolis 2010-10-05 11:41:10

@Necrolis: it may well be that you have inefficient implementations of memset/memcpy, but that still does not justify a SIMD implementation - you can almost certainly write a more efficient scalar implementation of these routines that *will* max out memory bandwidth without resorting to SIMD. However it's an interesting exercise and you'll learn a lot in the process so if you have the time and inclination then go for it.

Paul R 2010-10-05 12:25:38

@Paul R - I do have some of these SSEx general-purpose functions that do outperform the respective non-SSE versions, although I don't know _how_ far from optimal the non-SSE versions are. I was therefore wondering if you have any data to support your claim re. scalar code being as performant as SSE? Or could you point us to some scalar code, that in your view, does max out memory bandwidth?

PhiS 2010-10-05 14:35:40

Necrolis 2010-10-05 14:51:14

Just noticed I replied to PhiS' comment without looking who it was addressed to, my reply however still holds, libfreevec shows one or two functions in its benchmarks that are dominated by a glibc 32 variant, but the rest are all won by SSE versions

Necrolis 2010-10-05 15:12:31

@PhiS: I do have relevant data but it's work-related and I'm not at liberty to share it. To be fair I think there are some very specific cases where you can get a small improvement with SIMD, but it's very architecture-specific and the gains are small, which makes a general purpose implementation a little tricky and hard to justify. The big wins with SIMD are with computational code of course, where you can see an order of magnitude performance boost, compared to maybe 10% - 20% at best for a SIMD memcpy.

Paul R 2010-10-05 16:00:32

@Necrolis: how big is the speed difference you are seeing ? I'm guessing its small ? Out of interest you might want to try using the Intel ICC compiler - see if you can beat its memset/memcpy with anything hand-rolled.

Paul R 2010-10-05 16:02:56

@Paul R - thanks for the info. I agree that in a fair number of cases the gains aren't huge, but then again, a 20% performance improvement can be significant.

PhiS 2010-10-05 17:00:02

@Paul R - I just checked, the range of speed improvements (SIMD/non-SIMD ratios) vary from ~1.5x to ~22x (!) in my test cases (most of these are SSE2 or SSSE3).

PhiS 2010-10-05 17:05:02

@PhiS: 22x sounds a little suspect ! Is this just memcpy or do you have other library functions too ? And have you tried calculating the throughput in GB/sec and comparing this with your cache and/or DRAM bandwidth ?

Paul R 2010-10-05 17:50:09

@Paul R: gonna update the post with some tests, showing SSE(1) vs MSVC crt vs my own x86-asm version. unfortunately no ICC, as I don't have access to it(gotta pay for windows versions last I looked)

Necrolis 2010-10-06 06:47:19

@Necrolis: you can get a free 30 day evaluation license for ICC from Intel. Also I believe the Linux version is free for non-commercial use. Note that MSVC is a pretty poor compiler and is probably quite misleading for any kind of baseline measurement of performance - better to use gcc or ICC.

Paul R 2010-10-06 08:09:16

@Paul R: checking out that ICC trial, thanks for the pointer(will add GCC 4.5.1 too). as for MSVC being 'a bad compiler' I don't think thats true at all, yes it has its pitfalls, most of the time those can be gotten around. its scalar optimization is very good (it used to be streaks ahead of GCC, now they are on par), its vectorization leaves lots to be desired though. IMO most 'bad tests' come from poor setup/project options and poor code, as I spend a lot of time looking at MSVC code, and it comes out pretty clean and well optimized, sometimes it can break though, unfortunatly....

Necrolis 2010-10-06 14:36:38

@Necrolis: MSVC generates really bad code for modern CPUs (Core 2 Duo, Core i7, etc) - try running the same code on MSVC and ICC to see the difference. It also generates horrible SSE code. Not to mention the fact that it *still* doesn't support C99 and has horrible and unnecessary ABI restrictions which make SSE coding unnecessarily cumbersome. I guess a lot of people like it because it's what they are used to.

Paul R 2010-10-06 15:25:15

@Paul R: I have no argument about the SSE code, hence why I'm looking for SSE functions to avoid msvc's ones :P the lack of C99 is annoying, but not deal breaking. unfortunaly I don't have access to any 'recent' CPU's other than the cor 2 Duo's at uni, but I'll give it a try, see what pops out, got any suggestions for a good peice of code to use as a 'playground'/testbed in this regard?

Necrolis 2010-10-06 15:54:59

@Necrolis: SSE doesn't really get interesting until you have at least a Core 2 Duo and SSSE3. Prior to that it was a kludge (128 bit operations were really 2 x 64 bit operations, and the instruction set was very limited). I guess you can still play with some of the basics though, even if you have an old PC, but if you're serious about SIMD and performance then you might want to look at getting something a little more up-to-date.

Paul R 2010-10-06 16:00:51

@Paul R: yeah, unfortunatly it seems all the really 'cool' stuff is out of my reach, and being a student in a country with a horrid exchange rate makes it that much harder :| been running into the same stuff when it comes to HLSL and pixel shaders, makes one curse the advancement of technology sometimes :)

Necrolis 2010-10-06 16:04:17

@Necrolis: commiserations - all I can suggest is writing to [email protected]. ;-)

Paul R 2010-10-06 16:06:19

@Paul R: HAHAHA, now you'd made me curious as to whether thats a valid email address, though I'd be more inclined to ask for a nice vacation programming job, then I can get some exp AND a new PC :)

Necrolis 2010-10-06 16:08:18

@Paul R - no, 22x wasn't for memcpy, but for basic string-scanning functions (pre-SSE4, though).

PhiS 2010-10-07 19:09:11

Answer 2

+1 A:

Maybe libSIMDx86?

http://simdx86.sourceforge.net

frank 2010-10-05 12:32:21

Although a nice library, its mainly geared toward matrix and vector math(the only parts of interest to me in it are the 3 rooting functions from the math section).

Necrolis 2010-10-05 15:03:10

Answer 3

+1 A:

You can use the apple's or OpenSolaris's libc. These libc implementations contain what you are looking for. I was looking for these kind of things some 6 years back and I had to painfully write it the hard-way.

Ages ago I remember following a coding contest called 'fastcode' project. They did some awesome ground breaking optimisation for that time using Delphi. See their results page. Since it is written in Pascal's fast function call-model (copying arguments to registers) converting to C styled stdc function call-models (pushing on stack) may be a bit awkward. This project has no updates since a long-time especially, no code is written for SSE4.2.

Solaris -> src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/

Apple -> www.opensource.apple.com/source/Libc/Libc-594.9.1/

Eshan 2010-10-05 15:05:15

these look promising, unfortunately I don't have time to go spelunking around the Apple/Solaris libs(looks like a maze of folders to me). the fast code looks real good though, pity not every thing there seems to have source code

Necrolis 2010-10-06 06:45:10

Just a small note: Most such implementations put each function in their own file. So all you need to search is for some directory which mentions the platform architecture say 'x86' or 'i386' and search for file names which end with '.s'.

Eshan 2010-10-07 14:26:45

@Necrolis, @Paul R: Did you people bump into similar high-speed optimisation using GPUs like nVidia, or ATI? Is it possible? Heard a lot about it, but never had a chance to see any assembly stuff that actually makes use of it. At best I end up with OpenGL or DirectX calls but nothing below that.

Eshan 2010-10-08 06:18:59

Necrolis 2010-10-08 07:28:15

Answer 4

+2 A:

Here's an article on how to use SIMD instructions to vectorize the counting of characters:

http://porg.es/blog/ridiculous-utf-8-character-counting

carlo 2010-10-06 10:52:01

+1, very nice, just a pitty its SSE2 and one needs to map GCC built-ins to MSVC :(

Necrolis 2010-10-06 14:46:34

Answer 5

A:

I personally wouldn't bother trying to write super-optimized versions of libc functions trying to handle every possible scenario with good performance.

Instead, write optimized versions for specific situations, where you know enough about the problem at hand to write proper code... and where it matters. There's a semantic difference between memset and ClearLargeBufferCacheWriteThrough.

snemarch 2010-10-08 09:00:50

yes, thats why I mention both a best-case for general use, and versions that are far more specific(and configurable via defines). I think I'm just gonna start something on github during the christmas break, see if i can address this, my problem then boils down to my SSE knowledge being poor, my x86 optimization knowledge on the other hand is very strong.

Necrolis 2010-10-08 09:33:35

Answer 6

+1 A:

Honestly, what I would do is just install the Intel C++ Compiler and learn the various automated SIMD optimization flags available. We've had very good experience optimizing code performance by simply compiling it with ICC.

Keep in mind that the entire STL library is basically just header files, so the whole thing is compiled into your exe/lib/dll, and as such can be optimized however you like.

ICC has many options and lets you specify (at the simplest) which SSE levels to target. You can also use it to generate a binary file with multiple code paths, such that if the optimal SSE configuration you compiled against isn't available, it'll run a different set of (still optimized) code configured for a less capable SIMD CPU.

Computer Guru 2010-10-08 09:11:17

of course I'd have to do that all within 30 days, as I don't have the money to purchase a 'full' licence. I decided one path is to do what you recommeneded, but using GCC 4.5.x instead. however its still time consuming, and I was hoping someone had already gone through part of this. also, the STD library isn't always shoved(statically linked) in the binary, with MSVC, it'll link to msvcrtxx.dll for most non-trivial functions.

Necrolis 2010-10-08 09:30:41

GCC doesn't do the same optimizations ICC does - that's why there's a copy of ICC for linux and specially compiled Linux kernels that tout the fact they've been compiled with ICC. I didn't mean to say STD but rather STL. STL is always statically linked, IIRC.

Computer Guru 2010-10-08 10:29:55

Answer 7

+1 A:

Here's a fast memcpy implementation in C that can replace the standard library version of memcpy if necessary:

http://www.danielvik.com/2010/02/fast-memcpy-in-c.html

jefd 2010-10-08 09:44:01

its a nice link, however, his version falls quite hard, my assembly version goes at almost double the speed, and sometimes more than double(first is his, under thru-c, second set is mine, under thru-asm): http://necrolis.pastebin.com/pAFzJYr7

Necrolis 2010-10-08 10:07:38

ansaurus

tags:

views:

answers:

An SSE Stdlib-esque Library?

related questions