views:

142

answers:

6

We like to think that a memory access is fast and constant, but on modern architectures/OSes, that's not necessarily true.

Consider the following C code:

int i = 34;
int *p = &i;

// do something that may or may not involve i and p

{...}

// 3 days later:

*p = 643;

What is the estimated cost of this last assignment in CPU instructions, if

  • i is in L1 cache,
  • i is in L2 cache,
  • i is in L3 cache,
  • i is in RAM proper,
  • i is paged out to an SSD disk,
  • i is paged out to a traditional disk?

Where else can i be?

Of course the numbers are not absolute, but I'm only interested in orders of magnitude. I tried searching the webs, but Google did not bless me this time.

+6  A: 

Here's some hard numbers, demonstrating that exact timings vary from CPU family and version to version: http://www.agner.org/optimize/

These numbers are a good guide:

L1           1 ns
L2           5 ns
RAM         83 ns
Disk  13700000 ns

And as an infograph to give you the orders of magnitude:

Click for the big view (src http://news.ycombinator.com/item?id=702713)

Will
Is it just me or is that image just one big red honkin' rectangle?
paxdiablo
@paxdiablo yeah you really need to click on it to see the top left where the L1 is
Will
+1  A: 

It could also be in a CPU-register. The C/C++-keyword "register" tells the CPU to keep the variable in a register, but you can't guarantee it will stay or even ever get in there.

gotb
The C/C++-keyword "register" _suggests_ to the _compiler_ that it should keep the variable in a register !!!
paxdiablo
If you take an address of it, it can not be in a register anymore.
Berkus
A: 

Where else can i be?

i and *i are different things, both of them can be located in any of the locations in your list. The pointer address might additionally still be stored in a CPU register when the assignment is made, so it doesn't need to be fetched from RAM/Cache/…

Regarding performance: this is highly CPU-dependent. Thinking in orders of magnitude, accessing RAM is worse than accessing cache entries and accessing swapped-out pages is the worst. All are a bit unpredictable because they depend on other factors as well (i.e. other processors, depending on the system architecture).

Alexander Gessler
A: 

As long as the Cache/RAM/Harddisk/SSD is not busy serving other access (e.g. DMA requests) and that the hardware is reasonably reliable, then the cost is still constant (though they may be a large constant).

When you get a cache miss, and you have to page to harddisk to read the variable, then it's just a simple harddisk read request, this cost is huge, as the CPU has to: send interrupt to the kernel for harddisk read request, send a request to harddisk, wait for the harddisk to write the data to RAM, then read the data from RAM to cache and to a register. However, this cost is still constant cost.

The actual numbers and proportions will vary depending on your hardware, and on the compatibility of your hardware (e.g. if your CPU is running on 2000 Mhz and your RAM sends data at 333 Mhz then they doesn't sync very well). The only way you can figure this out is to test it in your program.

And this is not premature optimization, this is micro-optimization. Let the compiler worry about these kind of details.

Lie Ryan
A sub-optimal memory access pattern isn't something the compiler can sort out for you. You need to write the code so that the memory is accessed in the best way. If you care at all about the performance of the result, this isn't premature optimization! It's just saving yourself some time by not doing the wrong thing in the first place.(Also, cache misses are unrelated to the hard disk.)
brone
Yes, it is not premature optimization; it's micro-optimization. And it *is* the job of the compiler (and CPU) to rearrange instructions to minimize cache misses (and page misses).
Lie Ryan
But it's not the compiler's job to, say, compile a B-tree when you coded a binary tree. A compiler is not some magic pixie dust you can sprinkle on your code to make it fast.
Jurily
Most optimizing compilers and to some extent, the CPU, already did that (rearrange instructions for optimal memory access pattern). Rearranging bytecode/assembly instructions is a mechanical process, that's why a compiler/CPU can do it. Just choose a good algorithm and a good data structure, and let the compilers (and CPU) do this sort of micro-optimizations. (there are memory access optimizations that are not micro-optimization, particularly with large arrays, choosing a cache-friendly algorithm helps a lot; but in the case being presented, it is micro-optimization)
Lie Ryan
+3  A: 

Norvig has some values from 2001. Things have changed some since then but I think the relative speeds are still roughly correct.

Ken
A: 

These numbers change all the time. But for rough estimates for 2010, Kathryn McKinley has nice slides on the web, which I don't feel compelled to copy here.

The search term you want is "memory hierarchy" or "memory hierarchy cost".

Norman Ramsey