views:

106

answers:

2

I'm studying JIT design with regard to dynamic languages VM implementation. I haven't done much Assembly since the 8086/8088 days, just a little here or there, so be nice if I'm out of sorts.

As I understand it, the x86 (IA-32) architecture still has the same basic limited register set today that it always did, but the internal register count has grown tremendously, but these internal registers are not generally available and are used with register renaming to achieve parallel pipelining of code that otherwise could not be parallelizable. I understand this optimization pretty well, but my feeling is, while these optimizations help in overall throughput and for parallel algorithms, the limited register set we are still stuck with results in more register spilling overhead such that if x86 had double, or quadruple the registers available to us, there may be significantly less push/pop opcodes in a typical instruction stream? Or are there other processor optmizations that also optimize this away that I am unaware of? Basically if I've a unit of code that has 4 registers to work with for integer work, but my unit has a dozen variables, I've got potentially a push/pop for every 2 or so instructions.

Any references to studies, or better yet, personal experiences?

EDIT: x86_64 has 16 registers, which is double x86-32, thanks for the correction and info.

+2  A: 

In addition to renaming registers to hide bubbles due to instruction latencies, most x86 architectures are smart enough to count pushes and pops and rename those onto registers as well. Remember that the instruction decoder on the x86 actually performs a sort of JIT compilation, turning the x86 instruction stream into a small microcode program stored in the trace cache. Part of this process includes intercepting small-offset stack loads and turning those into registers as well. Thus something like (the patently silly and purely for example):

lwz eax,[ebp]
lwz ebx,[ebp+4]
add eax,[edx+0]
push eax 
lwz eax,[ebp+8]
add eax,ebx
pop ebx
add eax,ebx

cooks into something like (pretend internal registers are named eg r0..r16):

lw r3, edx
lw r1, ebp
lw r2, ebp+4 ; the constant '4' is usually stored as an immediate operand
add r1,r2
or r4,r1,r1 ;; move r1 to r4
lw r1, ebp+8
add r1,r2
or r2,r4,r4
add r1,r2

Of course a magically smart decoder (unlike the one that actually fits into transistor count) would collapse some of the unnecessary moves there, but the point I am making is that push/pop and stores/loads to esp+(some small number) are actually turned into shadow registers.

Crashworks
Thanks CrashWorks, great answer. Do you have a good reference for this? I have several architecture books and none of them mention this, but it was my suspicion that something like this was going on.
mrjoltcola
You can infer much of this information from chapter 2 of Intel's optimization manual ( http://www.intel.com/products/processor/manuals/ ). You can also run some controlled experiments to try to figure out some of the "black box" internals. And you could always go poking through Intel's patents: after all, the purpose of a patent is that they have to tell you how it works! #5740414 might be a place to start.
Crashworks
Thanks, good points. First step is to order newer manuals than my 15 year old ones. :)
mrjoltcola
That looks more like PPC than x86..
Jens Björnhager
Yes, internally the x86 instruction stream gets decoded to something more like RISC, and it's the decoded trace that actually gets stored in the instruction cache and executed.
Crashworks
+1  A: 

Two points:

(1) x86-64 doubles the number of registers to 16

(2) in modern x86 CPUs, an instruction that uses a memory location that is already in L1 cache is nearly as fast as the same operation with a register operand, so you can almost think of L1 as "register memory"

Paul R
Re: (2) -- I've timed the latency of an L1 *hit* on my i7 at about 9 cycles, which (astonishingly) is actually slower than the Core Duo.
Crashworks
How did you time that ? It should only be 2 cycles on a Core i7, I think, which should not be a problem on a reasonably well scheduled instruction stream.
Paul R
Re: (1), time to update my hardware manuals. I have 386, 486 and Pentium but nothing newer. Thanks to both you guys for the answers.
mrjoltcola
Sorry, I had my numbers mixed up in my notes -- that's actually the latency for the i5. We just used the cache latency tool from cpuid.com .
Crashworks