ansaurus

Question

Answer 1

+2 A:

If you're on Windows, use the DirectX APIs, which has specific GPU-optimized routines for graphics handling (how fast could it be? Your CPU isn't loaded. Do something else while the GPU munches it).

If you want to be OS agnostic, try OpenGL.

Do not fiddle with assembler, because it is all too likely that you'll fail miserably to outperform 10 year+ proficient library-making software engineers.

jpinto3912 2009-11-11 14:00:27

i need it to be performed in MEMORY, that is, it cannot happen on the GPU. :) Also, i don't intend, myself, to outperform the library functions (hence why i ask the question here) but i'm sure there is somebody on stackoverflow who _can_ outperform the libs :) Further, library writers are typically restricted by portability requirements - as i stated I only care about the x86 platform, so perhaps further x86 specific optimizations are possible.

banister 2009-11-11 14:12:02

+1 since it's good first advice to be given - even though it does not apply in banister's case.

peterchen 2009-11-12 18:54:03

Answer 2

+9 A:

Courtesy of William Chan and Google. 30-70% faster than memcpy in Microsoft Visual Studio 2005.

void X_aligned_memcpy_sse2(void* dest, const void* src, const unsigned long size_t)
{

  __asm
  {
    mov esi, src;    //src pointer
    mov edi, dest;   //dest pointer

    mov ebx, size_t; //ebx is our counter 
    shr ebx, 7;      //divide by 128 (8 * 128bit registers)


    loop_copy:
      prefetchnta 128[ESI]; //SSE2 prefetch
      prefetchnta 160[ESI];
      prefetchnta 192[ESI];
      prefetchnta 224[ESI];

      movdqa xmm0, 0[ESI]; //move data from src to registers
      movdqa xmm1, 16[ESI];
      movdqa xmm2, 32[ESI];
      movdqa xmm3, 48[ESI];
      movdqa xmm4, 64[ESI];
      movdqa xmm5, 80[ESI];
      movdqa xmm6, 96[ESI];
      movdqa xmm7, 112[ESI];

      movntdq 0[EDI], xmm0; //move data from registers to dest
      movntdq 16[EDI], xmm1;
      movntdq 32[EDI], xmm2;
      movntdq 48[EDI], xmm3;
      movntdq 64[EDI], xmm4;
      movntdq 80[EDI], xmm5;
      movntdq 96[EDI], xmm6;
      movntdq 112[EDI], xmm7;

      add esi, 128;
      add edi, 128;
      dec ebx;

      jnz loop_copy; //loop please
    loop_copy_end:
  }
}

You may be able to optimize it further depending on your exact situation and any assumptions you are able to make.

You may also want to check out the memcpy source (memcpy.asm) and strip out its special case handling. It may be possible to optimise further!

Stuart 2009-11-11 14:08:54

this is the kind of thing im looking for, thanks alot! :)

banister 2009-11-11 14:14:59

Note: the performance of this memcopy will be wildly dependant on the quantity of data to copy and the cache size. For instance, prefetchs and non-temporal moves may bog down the performance for smaller (fitting into L2) copies compared to regular movdqa's.

RaphaelSP 2009-11-11 14:39:23

banister: don't forget to mail him that you used his code in your project ;) [ http://williamchan.ca/portfolio/assembly/ssememcpy/source/viewsource.php?id=readme.txt ]

ardsrk 2009-11-11 14:48:22

+1, THANK YOU!!!

Tim Post 2009-11-12 18:46:14

I remember reading this code in an AMD64 manual first. And the code isn't optimal on intel, where it has cache bank aliasing issues.

drhirsch 2009-11-29 10:33:59

Answer 3

+2 A:

If specific to Intel processors, you might benefit from IPP. If you know it will run with an Nvidia GPU perhaps you could use CUDA - in both cases it may be better to look wider that optimising memcpy() - they provide opportunities for improving your algorithm at a higher level. They are both however reliant on specific hardware.

Clifford 2009-11-11 14:10:02

Answer 4

A:

At any optimisation level of -O1 or above, GCC will use builtin definitions for functions like memcpy - with the right -march parameter (-march=pentium4 for the set of features you mention) it should generate pretty optimal architecture-specific inline code.

I'd benchmark it and see what comes out.

caf 2009-11-11 21:54:15

Answer 5

+1 A:

The SSE-Code posted by hapalibashi is the way to go.

If you need even more performance and don't shy away from the long and winding road of writing a device-driver: All important platforms nowadays have a DMA-controller that is capable of doing a copy-job faster and in parallel to CPU code could do.

That involves writing a driver though. No big OS that I'm aware of exposes this functionality to the user-side because of the security risks.

However, it may be worth it (if you need the performance) since no code on earth could outperform a piece of hardware that is designed to do such a job.

Nils Pipenbrinck 2009-11-12 17:31:27

ansaurus

tags:

views:

answers:

Very fast memcpy for image processing?

related questions