views:

126

answers:

2

I'm designing a real-time system that occasionally has to duplicate a large amount of memory. The memory consists of non-tiny regions, so I expect the copying performance will be fairly close to the maximum bandwidth the relevant components (CPU, RAM, MB) can do. This led me to wonder what kind of raw memory bandwidth modern commodity machine can muster?

My aging Core2Duo gives me 1.5 GB/s if I use 1 thread to memcpy() (and understandably less if I memcpy() with both cores simultaneously.) While 1.5 GB is a fair amount of data, the real-time application I'm working on will have have something like 1/50th of a second, which means 30 MB. Basically, almost nothing. And perhaps worst of all, as I add multiple cores, I can process a lot more data without any increased performance for the needed duplication step.

But a low-end Core2Due isn't exactly hot stuff these days. Are there any sites with information, such as actual benchmarks, on raw memory bandwidth on current and near-future hardware?

Furthermore, for duplicating large amounts of data in memory, are there any shortcuts, or is memcpy() as good as it will get?

Given a bunch of cores with nothing to do but duplicate as much memory as possible in a short amount of time, what's the best I can do?

+2  A: 

On newer CPU's such as the Nehalem, and on AMD's since the Opteron, the memory is "local" to one CPU, where a single CPU may have multiple cores. That is, it takes a certain amount of time for a core to access the local memory attached to it's CPU, and more time for the core to access remote memory, where remote memory is memory that is local to other CPUs. This is called non-uniform memory access, or NUMA. For the best memcpy performance, you want to set your BIOS to NUMA mode, pin your threads to cores, and always access local memory. Find out more about NUMA on wikipedia.

Unfortunately I do not know of a site or recent papers on memcpy performance on recent CPUs and chipsets. You best bet is probably to test it yourself.

As for memcpy() performance, there are wide variations, depending on the implementation. The Intel C library (or possibly the compiler itself) has a memcpy() that is much faster than the one provided with Visual Studio 2005, for instance. At least on Intel machines.

The best memory copy you will be able to do will depend on the alignment of your data, wether you are able to use vector instructions, and page size, etc. Implementing a good memcpy() is surprisingly involved, so I recommend finding and testing as many implementations as possible before writing your own. If you know more specifics about your copy, such as alignment and size, you might be able to implement something faster than Intel's memcpy(). If you want to get into the details, you might start with the Intel and AMD optimization guides, or Agner Fog's software optimization pages.

mch
While generally informative, some aspects of this answer seems to miss the specifics I mention. For instance the fact that I explicitly say I am copying large chunks of memory. This means caches aren't of much interest: The memory bottleneck in this case is basically from RAM and, well, back into RAM, right? The way I read your note on "local" memory is that you are talking about cache. Or are there parts of main RAM that are "local" to cores? I haven't heard of such a thing, but feel free to correct me.
No, I was not refering to the cache when I was talking about local memory. In NUMA systems, different banks of RAM are physically attached to different cores, or rather, different CPU sockets. In a two socket system, half of the RAM is physically attached to the first socket, and half is attached to the second socket. If the second a core in the second socket has to access memory attached to the first socket, the data must travel a longer path, going first through the first socket. I'll add a link to the write-up.
mch
@mch perhaps you should edit your answer as it refer to cores and not CPU sockets.
Alexandre Jasmin
Yeah, it's the core part that confused me as well.
I really did mean socket, but I can see how it is confusing, and perhaps I should have said 'CPU' and not 'socket'. I think the problem is that I'm thinking of a multi-CPU system, where each CPU has multiple cores, which is probably not familiar to a lot of people who only have a single CPU. Say each CPU has two cores. If your system has two CPU's, you have four cores total. It is not the cores that have local memory, it is the *CPU*, so a machine with 16 GB RAM will generally have 8 GB RAM local to each CPU.
mch
Note that NUMA isn't relevant if you don't have multiple CPUs. If you only have one CPU, all memory is 'local' to all the cores on that CPU, even if you have a 6 core Core i7 Extreme.
mch
@mch I get all that. Can you edit the answer? I think the part *the memory is "local" to one core* should be clarified
Alexandre Jasmin
@Alexandre Jasmin, thanks for pointing out that problem. I hope it's clearer now.
mch
+1  A: 

I think you're approaching the problem the wrong way. The goal, I assume, is to export a consistent snapshot of your data without destroying your real-time performance. Don't use hardware, use an algorithm.

What you want to do is define a journaling system on top of your data. When you start your in-memory transfer, you have two threads: the original that does work and thinks it is modifying the data (but is actually only writing to the journal), and a new thread that copies the old (unjournaled) data to a separate spot so it can slowly write it out.

When the new thread is done, you put it to work merging the data set with the journal until the journal is empty. When it's complete, the old thread can go back to interacting directly with the data instead of reading/writing through the journal-modified version.

Finally, the new thread can go over to the copied data and start slowly passing it away to a remote source.

If you set up a system like this, you can get essentially instant snapshotting of arbitrarily large amounts of data in a running system, as long as you can finish the in-memory copy before the journal gets so full that the real-time system can't keep up with its processing demands.

Rex Kerr
I actually have an algorithm that does solve the problem that initially lead to my question. My question however was about the actual performance of memory copying, not the various ways one can avoid moving memory.