ansaurus

Question

Answer 1

+4 A:

Well yeah that does look like it will mainly be L1 cache misses.

10 cycles for an L1 cache miss does sound about reasonable, probably a little on the low side.

A read from RAM is going to take of the order of 100s or may be even 1000s (Am too tired to attempt to do the maths right now ;)) of cycles so its still a huge win over that.

Goz 2009-07-14 16:30:04

"a little on the low side" - with 80K of data and 32K of L1, you'd be disappointed if every fetch missed cache, so a little low makes sense to me.

Steve Jessop 2009-07-14 17:13:47

good point .. and the fact that the order has been randomised means that there must be about 50/50 cache misses to hits. Of course it'd be nice and easy to come up with a read pattern that would mean every access missed :)

Goz 2009-07-14 17:18:20

I agree - good point. If the cache is 32K and it is largely dedicated to holding the array, then maybe 40% of the references would be hits. So a 60% miss rate would take the cost up to about 17 cycles per miss (again assuming my math is correct).

Mark Wilkins 2009-07-14 17:33:17

http://www.sandpile.org/impl/p4.htm suggests that the latency for an L2 Cache read from a 90 to 65nm P4 is between 18 and 20 cycles. So Mark's quick calculation above appears pretty spot on :)

Goz 2009-07-15 10:54:43

In fact assuming 18 cycles per miss and plugging that in that gives us a value of around 56.3% L1 cache misses and assuming 20 cycles gives us a value of 50.6% L1 Cache misses.

Goz 2009-07-15 11:16:53

Answer 2

A:

It's difficult to say anything for sure without a lot more testing, but in my experience that scale of difference definitely can be attributed to the CPU L1 and/or L2 cache, especially in a scenario with randomized access. You could probably make it even worse by ensuring that each access is at least some minimum distance from the last.

Tim Sylvester 2009-07-14 16:31:15

Answer 3

+5 A:

You should take a read of "What every programmer should know about memory" by Ulrich Drepper - it goes deep into the timing of memory access, and access-pattern and cache interactions.

caf 2009-07-15 08:03:54

Thanks for that link. It looks like a useful document. If I understand it correctly, it points out that read miss (in the level 1 cache) would maybe incur 10 extra cycles and a write miss about 18 cycles with current architectures. So the ballpark numbers that are coming up in this whole thread seem to fit pretty well.

Mark Wilkins 2009-07-15 17:41:07

Answer 4

A:

The easiest thing to do is to take a scaled photograph of the target cpu and physically measure the distance between the core and the level-1 cache. Multiply that distance by the distance electrons can travel per second in copper. Then figure out how many clock-cycles you can have in that same time. That's the minimum number of cpu cycles you'll waste on a L1 cache miss.

You can also work out the minimum cost of fetching data from RAM in terms of the number of CPU cycles wasted in the same way. You might be amazed.

Notice that what you're seeing here definitely has something to do with cache-misses (be it L1 or both L1 and L2) because normally the cache will pull out data on the same cache line once you access anything on that cache-line requiring less trips to RAM.

However, what you're probably also seeing is the fact that RAM (even though it's calls Random Access Memory) still preferres linear memory access.

Jasper Bekkers 2009-07-15 08:16:17

<pendant> The speed of an electron does not relate to the speed of the current / voltage. Electrons move really slowly. </pedant>

Skizz 2009-07-15 08:28:01

Yeah, it's more to do with capacitance and how long the ringing takes to settle down.

Crashworks 2009-07-15 08:31:12

@Skizz, could you show me how to convert those units into seconds so I can work that into the answer?

Jasper Bekkers 2009-07-15 08:45:20

The very least you could do is include the speed of an electrical wave in copper, which is IIRC about 0.6c (close enough for this purpose)

MSalters 2009-07-15 14:12:41

Answer 5

+3 A:

3.2ns for an L1 cache miss is entirely plausible. For comparison, on one particular modern multicore PowerPC CPU, an L1 miss is about 40 cycles -- a little longer for some cores than others, depending on how far they are from the L2 cache (yes really). An L2 miss is at least 600 cycles.

Cache is everything in performance; CPUs are so much faster than memory now that you're really almost optimizing for the memory bus instead of the core.

Crashworks 2009-07-15 08:27:51

Answer 6

+3 A:

While I can't offer an answer to whether or not the numbers make sense (I'm not well versed in the cache latencies, but for the record ~10 cycle L1 cache misses sounds about right), I can offer you Cachegrind as a tool to help you actually see the differences in cache performance between your 2 tests.

Cachegrind is a Valgrind tool(the framework that powers the always-lovely memcheck) which profiles cache and branch hits/misses. It will give you an idea of how many cache hits/misses you are actually getting in your program.

Falaina 2009-07-15 08:28:53

Very nice. Thanks for the pointer to it. I've been aware of Valgrind but haven't used it before (most of my development is on Win32). I just now ran it on a Linux box and it reported a 41% miss rate for the "random" portion of the test. And the "in order" portion of the test had a negligible miss rate. Neither portion had any L2 miss rate to speak of.

Mark Wilkins 2009-07-15 15:34:24

Answer 7

A:

Some numbers for a 3.4GHz P4 from a Lavalys Everest run:

the L1 dcache is 8K (cacheline 64 bytes)
L2 is 512K
L1 fetch latency is 2 cycles
L2 fetch latency is about double what you are seeing: 20 cycles

More here: http://www.freeweb.hu/instlatx64/GenuineIntel0000F25_P4_Gallatin_MemLatX86.txt

(for the latencies look at the bottom of the page)

terminus 2009-07-15 14:06:08

ansaurus

tags:

views:

answers:

What is the Cost of an L1 Cache Miss?

related questions