views:

2167

answers:

9

I can understand this requirement for the old PPC RISC systems and even for x86-64, but for the old tried-and-true x86? In this case, the stack needs to be aligned on 4 byte boundaries only. Yes, some of the MMX/SSE instructions require 16byte alignments, but if that is a requirement of the callee, then it should ensure the alignments are correct. Why burden every caller with this extra requirement? This can actually cause some drops in performance because every call-site must manage this requirement. Am I missing something?

Update: After some more investigation into this and some consultation with some internal colleagues, I have some theories about this:

  1. Consistency between the PPC, x86, and x64 version of the OS
  2. It seems that the GCC codegen now consistently does a sub esp,xxx and then "mov"s the data onto the stack rather than simply doing a "push" instruction. This could actually be faster on some hardware.
  3. While this does complicate the call sites a little, there is very little extra overhead when using the default "cdecl" convention where the caller cleans up the stack.

The issue I have with the last item, is that for calling conventions that rely on the callee cleaning the stack, the above requirements really "uglifies" the codegen. For instance, what some compiler decided to implement a faster register-based calling style for its own internal use (ie any code that isn't intended to be called from other languages or sources)? This stack-alignment thing could negate some of the performance gains achieved by passing some parameters in registers.

Update: So far the only real answers have been consistency, but to me that's a bit too easy of an answer. I have well over 20 years experience with the x86 architecture and if consistency, not performance, or something else concrete, is really the reason then I respectfully suggest that is a bit naive for the developers to require it. They're ignoring nearly three decades of tools and support. Especially if they're expecting tools vendors to quickly and easily adapt their tools for their platform (maybe not... it is Apple...) without having to jump through several seemingly unnecessary hoops.

I'll give this topic another day or so then close it...

Related

+4  A: 

I believe it's to keep it inline with the x86-64 ABI.

Andrew Grant
That makes sense... to a point. What is the value in this, really? Only tool creators really care about this stuff as most developers simply rely on the tool to "do the right thing."
Allen Bauer
Maybe due to the (relatively) short life x86-32 is likely to have on the Mac?
Andrew Grant
A: 

In order to maintain consistancy in kernel - this allows the same kernel to be booted on multiple archetectures without modicfication

PixelSmack
That's the only thing that seems to be what folks say, however for higher level languages, this is a detail that is (should be) hidden. Any compiled x86-32 ObjC, C, or C++ application would not care since this is an opaque detail.
Allen Bauer
A kernel needs to be compatible with the call-stack of user processes because it will need to use that occasionally for working space to handle certain system-calls or interrupts.
TokenMacGuy
It doesn't seem to hurt the Windows and Linux kernels to not be aligned. What is so special about the MacOS on x86?
Allen Bauer
+3  A: 

I am not sure as I don't have first hand proof, but I believe the reason is SSE. SSE is much faster if your buffers are already aligned on a 16 bytes boundary (movps vs movups), and any x86 has at least sse2 for mac os x. It can be taken care of by the application user, but the cost is pretty significant. If the overall cost for making it mandatory in the ABI is not too significant, it may worth it. SSE is used quite pervasively in mac os X: accelerate framework, etc...

David Cournapeau
That is the best reason I can come up with as well... however the requirement is that the stack is aligned *before* the call. Once the callee is in control, the stack is no longer aligned! (the return address is now the top of the stack).
Allen Bauer
It doesn't matter so much that the stack pointer is not aligned at that point because you want the arguments to be aligned in memory. So with your typical stack frame, you are guaranteed that you are 16-byte aligned at 8(%ebp), which is your arguments begin.
Adam K. Johnson
+1  A: 

Hmm, didn't OS X ABI also do funny RISC like things like passing small structs in registers?

So that points to the consistency with other platforms theory.

Come to think of it, the FreeBSD syscall api also aligns 64-bit values. (like e.g. lseek and mmap)

Marco van de Voort
+11  A: 

From "Intel®64 and IA-32 Architectures Optimization Reference Manual", section 4.4.2:

"For best performance, the Streaming SIMD Extensions and Streaming SIMD Extensions 2 require their memory operands to be aligned to 16-byte boundaries. Unaligned data can cause significant performance penalties compared to aligned data."

From Appendix D:

"It is important to ensure that the stack frame is aligned to a 16-byte boundary upon function entry to keep local __m128 data, parameters, and XMM register spill locations aligned throughout a function invocation."

http://www.intel.com/Assets/PDF/manual/248966.pdf

rob mayoff
+1  A: 

While I cannot really answer your question of WHY, you may find the manuals at the following site useful:

http://www.agner.org/optimize/

Regarding the ABI, have a look especially at:

http://www.agner.org/optimize/calling_conventions.pdf

Hope that's useful.

PhiS
+1  A: 

This is an efficiency issue.

Making sure the stack is 16-byte aligned in every function that uses the new SSE instructions adds a lot of overhead for using those instructions, effectively reducing performance.

On the other hand, keeping the stack 16-byte aligned at all times ensures that you can use SSE instructions freely with no performance penalty. There is no cost to this (cost measured in instructions at least). It only involves changing a constant in the prologue of the function.

Wasting stack space is cheap, it is probably the hottest part of the cache.

I find this to be a very shallow explanation. Why does *every* function in the call chain have to do this work on the off chance that a SSE instruction *may* be used? If this "overhead" is no big deal, then it is "no big deal" to do it *at the point where the SSE instructions are being used!* I don't require my neighbors to keep *my* house clean.
Allen Bauer
Your conclusion is incorrect. Notice the difference between making and keeping. There is no work involved in keeping the stack 16-byte aligned. This simply involves changing a constant in the prologue to ensure that the stack is aligned. I've updated my original answer to underscore this. OTOH, making the stack 16-byte aligned involves work, and has a cost measured in instructions.
That is only assuming your compiler's code generator works like GCC's. The world is far more than GCC. If the compiler reserved stack space for all locals and all parameters for all functions the current function calls, that is valid. However, many compilers may not work that way, and in fact trying to *make* them work that way may be too costly. The other thing is that not *all* SSE instructions require alignment, only the MOVxxA instructions do. So even then the subset of potential instructions the system is tuning for is relatively small. An app may *never* use SSE, directly or indirectly.
Allen Bauer
The cost analysis is the same whether stack space for all locals is reserved by the prologue or not. Whenever stack space is allocatedsub $xx, %espis the way to do it. Keeping the stack 16 byte aligned means xx is a multuple of 16. All the compiler needs to do is to round up.Maybe you could give an example of where this hurts?
+2  A: 

First, note that the 16 bytes alignment is an exception introduced by Apple to the System V IA-32 ABI.

The stack alignment is only needed when calling system functions, because many system libraries are using SSE or Altivec extensions which require the 16 bytes alignment. I found an explicit reference in the libgmalloc MAN page.

You can perfectly handle your stack frame the way you want, but if you try to call a system function with a misaligned stack, you will end up with a misaligned_stack_error message.

Edit: For the record, you can get rid of alignment problems when compiling with GCC by using the mstack-realign option.

Laurent Etiemble
The problem is that the compiler doesn't really know that a given call is a system function or not. This means that the only "safe" thing to do is to ensure the stack remains aligned throughout the call chain. We already take advantage of this fact when dealing with hand-coded low-level assembler functions that are known to never end up calling system functions.
Allen Bauer
Oh another thing, it is kinda hard to "recompile with GCC" since we're in the process of modifying our existing Delphi compiler to target the Mac... GCC isn't involved since we've got our own frontend and code generator/backend that is why this is an issue.
Allen Bauer
+1  A: 

My guess is that Apple believes everyone just uses XCode (gcc) which aligns the stack for you. So requiring the stack to be aligned so the kernel doesn't have to is just a micro-optimization.

Mike