In our application we are running on a dual Xeon server with memory configured as 12gb local to each processor and a memory bus connecting the two Xeon's. For performance reasons, we want to control where we allocate a large (>6gb) block of memory. Below is simplified code -
DWORD processorNumber = GetCurrentProcessorNumber();
UCHAR nodeNumber = 255;
GetNumaProcessorNode((UCHAR)processorNumber, &nodeNumber );
// get amount of physical memory available of node.
ULONGLONG availableMemory = MAXLONGLONG;
GetNumaAvailableMemoryNode(nodeNumber, &availableMemory )
// make sure that we don't request too much. Initial limit will be 75% of available memory
_allocateAmt = qMin(requestedMemory, availableMemory * 3 / 4);
// allocate the cached memory region now.
HANDLE handle = (HANDLE)GetCurrentProcess ();
cacheObject = (char*) VirtualAllocExNuma (handle, 0, _allocateAmt,
MEM_COMMIT | MEM_RESERVE ,
PAGE_READWRITE| PAGE_NOCACHE , nodeNumber);
The code as is, works correctly using VS2008 on Win 7/64.
In our application this block of memory functions as a cache store for static objects (1-2mb ea) that are normally stored on the hard drive. My problem is that when we transfer data into the cache area using memcpy, it takes > 10 times as longer than if we allocate memory using new char[xxxx]
. And no other code changes.
We are at a loss to understand why this is happening. Any suggestions as to where to look?