views:

344

answers:

3

The following function operates with two values for T. One is 4 bytes (same as a DWORD). The other is 64 bytes. The Buffer is designed to store objects in its cache, and have them retrieved at a later date.

When using the 64 byte structure, the total amount of time in the function jumps dramatically. The seek time in the vector, the memcpy and even the time in the "SELF" part of the function all rise dramatically.

The store function is pretty much the inverse of the code below, and does not seem to suffer the same asymmetric timing.

Any ideas as to why?

template <class T>
void Buffer::retrieve ( T& Value )
   {
   int nTypeSize  = sizeof ( T );
   int nDWORDSize = sizeof ( DWORD );

   /*
    * Number of DWORDs needed to store this value.
    */

   int nDWORDCount = ( nTypeSize + 3 ) / 4;

   if ( m_nReadPosition + nDWORDCount >= m_nSize )
      return;

   memcpy ( &Value, &m_Cache[m_nReadPosition], nTypeSize );  //m_Cache is a DWORD vector.
   m_nReadPosition += nDWORDCount;
   }
+2  A: 

The memcpy time increasing could be just down to copying more bytes as the memcpy time will scale (naively) with the amount of memory being copied. It's not necessarily a linear scaling as some implementations of memcpy will optimize by copying 32 or 64 bits at a time.

The lookup in the std::vector shouldn't be scaling with object size as neither m_nReadPosition or m_Cache are dependent on T.

You will have some slow down for any code that manipulates T though as a 4 byte structure can be stored in a register on a 32-bit processor while anything larger will be more complicated for the compiler to deal with. It's possible this is adding some overhead.

How much is the total time increasing by? If it's a multiple of 16, I'd put it down purely to the change in size of T.

dominic hamon
... optimize by copying 32 or 64 bits at a time. Or even better, in some cases the implementation will make use to vector instructions present in the processor.
David Rodríguez - dribeas
Only if the data is aligned properly.
Crashworks
Typically what you would see in assembler code at the calling point of memcpy() for 4 bytes is a single mov-instruction. The call is optimized away by the compiler as it recognizes what you're trying to do.
Andreas Magnusson
A: 

I'd be suspicious of the profiler's attributing time to the functions called from here: the std::vector's accessor is an inline, and memcpy() is probably a compiler intrinsic, which means that in an optimized release build (you are timing a release build, right?) the bulk of their work would get attributed to the calling function.

So given that I'd run some controlled experiments to localize the slowdown. For example, the most likely culprit for the bulk of CPU time here is the memcpy(), so try taking it out of the equation temporarily:

volatile DWORD g_dummy;
void Buffer::retrieve ( T& Value )
   {
   /* ... */
   // memcpy ( &Value, &m_Cache[m_nReadPosition], nTypeSize );  
   g_dummy += m_Cache[m_nReadPosition]; // force the compiler to perform the vector lookup
   m_nReadPosition += nDWORDCount;
   }

and see how much of the slowdown that copy really accounts for.

Crashworks
We're not timing a release build. The release build has the same problem (by the profiler), but inlines all the code in the function. The detailed timing us from a debug build to attempt to isolate the problem.Interesting idea to isolate the memcpy and try treated as a DWORD though. We'll try that and post the results tomorrow.
Timing a non-release build gives you no useful information at all. The optimizations made by the compiler will radically change the performance characteristics of your program and invalidate whatever you learned from profiling debug builds.
Crashworks
+1  A: 

If you have a std::vector, why are you using memcpy? I would recommend using std::copy or std::vector's own methods. It's largely a stylistic change, but it does guarantee constructors get called.

As to why things are going slow, I'm not sure:

  • The seek time in the vector,

This is odd, because the vector points to contiguous memory, and seek time for memory is constant. That is, if I'm pointing at the first item in the vector, moving from it the second item will take as much time as moving from it to the last item (via +, +=, or std::advance).

If you're using a std::list then seek time will show up. But it shouldn't in a vector. And it definitely shouldn't in a memcpy operating on the raw memory of the vector.

  • the memcpy

You're copying 16 times as much data (4 bytes, vs. 64 bytes).

  • and even the time in the "SELF" part of the function all rise dramatically.

This is also odd, as

int nTypeSize  = sizeof ( T );
int nDWORDSize = sizeof ( DWORD );
int nDWORDCount = ( nTypeSize + 3 ) / 4;

are all compile-time constants and should be pre-computed by the compiler for each type T.

if ( m_nReadPosition + nDWORDCount >= m_nSize )

and

m_nReadPosition += nDWORDCount;

are the only lines in Buffer::retrieve that are actually executed at run time (other than the memcpy). Unless this increase is simply due to double-counting the memcpy (once under the heading "memcpy" and once under the heading "Buffer::retrieve."


Things to watch out for in a std::vector are piecemeal memory allocations and unneeded copies. You're not doing any in the sample code, though.

Max Lybbert