ansaurus

Question

P6 Architecture - Register renaming aside, does the limited user registers result in more ops spent spilling/loading?

Answer 1

+2 A:

In addition to renaming registers to hide bubbles due to instruction latencies, most x86 architectures are smart enough to count pushes and pops and rename those onto registers as well. Remember that the instruction decoder on the x86 actually performs a sort of JIT compilation, turning the x86 instruction stream into a small microcode program stored in the trace cache. Part of this process includes intercepting small-offset stack loads and turning those into registers as well. Thus something like (the patently silly and purely for example):

lwz eax,[ebp]
lwz ebx,[ebp+4]
add eax,[edx+0]
push eax 
lwz eax,[ebp+8]
add eax,ebx
pop ebx
add eax,ebx

cooks into something like (pretend internal registers are named eg r0..r16):

lw r3, edx
lw r1, ebp
lw r2, ebp+4 ; the constant '4' is usually stored as an immediate operand
add r1,r2
or r4,r1,r1 ;; move r1 to r4
lw r1, ebp+8
add r1,r2
or r2,r4,r4
add r1,r2

Of course a magically smart decoder (unlike the one that actually fits into transistor count) would collapse some of the unnecessary moves there, but the point I am making is that push/pop and stores/loads to esp+(some small number) are actually turned into shadow registers.

Crashworks 2010-03-17 07:10:58

Thanks CrashWorks, great answer. Do you have a good reference for this? I have several architecture books and none of them mention this, but it was my suspicion that something like this was going on.

mrjoltcola 2010-03-17 14:49:32

You can infer much of this information from chapter 2 of Intel's optimization manual ( http://www.intel.com/products/processor/manuals/ ). You can also run some controlled experiments to try to figure out some of the "black box" internals. And you could always go poking through Intel's patents: after all, the purpose of a patent is that they have to tell you how it works! #5740414 might be a place to start.

Crashworks 2010-03-17 18:48:25

Thanks, good points. First step is to order newer manuals than my 15 year old ones. :)

mrjoltcola 2010-03-17 22:55:45

That looks more like PPC than x86..

Jens Björnhager 2010-03-18 20:52:14

Yes, internally the x86 instruction stream gets decoded to something more like RISC, and it's the decoded trace that actually gets stored in the instruction cache and executed.

Crashworks 2010-03-19 08:33:25

Answer 2

+1 A:

Two points:

(1) x86-64 doubles the number of registers to 16

(2) in modern x86 CPUs, an instruction that uses a memory location that is already in L1 cache is nearly as fast as the same operation with a register operand, so you can almost think of L1 as "register memory"

Paul R 2010-03-17 10:23:05

Re: (2) -- I've timed the latency of an L1 *hit* on my i7 at about 9 cycles, which (astonishingly) is actually slower than the Core Duo.

Crashworks 2010-03-17 10:25:56

How did you time that ? It should only be 2 cycles on a Core i7, I think, which should not be a problem on a reasonably well scheduled instruction stream.

Paul R 2010-03-17 10:31:25

Re: (1), time to update my hardware manuals. I have 386, 486 and Pentium but nothing newer. Thanks to both you guys for the answers.

mrjoltcola 2010-03-17 14:52:41

Sorry, I had my numbers mixed up in my notes -- that's actually the latency for the i5. We just used the cache latency tool from cpuid.com .

Crashworks 2010-03-18 00:46:52

ansaurus

tags:

views:

answers:

P6 Architecture - Register renaming aside, does the limited user registers result in more ops spent spilling/loading?

related questions