Mac OS X Stack Alignment

Posted by on in Blogs
Preface:  I'm back (again) [shades of Men In Black 2, sort of].  If you don't understand what that means, don't worry about it.

In the Mac OS X ABI Function Call Guide there is an innocent little sentence:  "The stack is 16-byte aligned at the point of function calls."  We've not been able to find out why this is required for the IA-32 environment, but they really mean it, and there are deep implications.

OS X is, under the covers, your basic *nix system, making heavy use of shared libraries.  When your code references a function in a shared library, a little call stub is constructed by the linker, and the loader fixes up this stub to point to a loader helper function which will perform a lazy bind to the function.  Take a look at the first instructions of that helper function:
0x8fe18be0 <__dyld_fast_stub_binding_helper_interface+0>:	push   0x0
0x8fe18be2 <__dyld_stub_binding_helper_interface+0>: sub esp,0x64
0x8fe18be5 <__dyld_stub_binding_helper_interface+3>: mov DWORD PTR [esp+0x54],eax
0x8fe18be9 <__dyld_stub_binding_helper_interface+7>: mov eax,DWORD PTR [esp+0x68]
0x8fe18bed <__dyld_stub_binding_helper_interface+11>: mov DWORD PTR [esp+0x60],eax
0x8fe18bf1 <__dyld_stub_binding_helper_interface+15>: mov DWORD PTR [esp+0x68],ebp
0x8fe18bf5 <__dyld_stub_binding_helper_interface+19>: mov ebp,esp
0x8fe18bf7 <__dyld_stub_binding_helper_interface+21>: add ebp,0x68
0x8fe18bfa <__dyld_stub_binding_helper_interface+24>: mov DWORD PTR [esp+0x58],ecx
0x8fe18bfe <__dyld_stub_binding_helper_interface+28>: mov DWORD PTR [esp+0x5c],edx
0x8fe18c02 <__dyld_misaligned_stack_error+0>: movdqa XMMWORD PTR [esp+0x10],xmm0
0x8fe18c08 <__dyld_misaligned_stack_error+6>: movdqa XMMWORD PTR [esp+0x20],xmm1
0x8fe18c0e <__dyld_misaligned_stack_error+12>: movdqa XMMWORD PTR [esp+0x30],xmm2
0x8fe18c14 <__dyld_misaligned_stack_error+18>: movdqa XMMWORD PTR [esp+0x40],xmm3

Note the last four instructions.  If you backtrack to the first couple of instructions, you'll see that ESP gets tweaked by a total of 0x68 bytes.  Thus, if the stack isn't aligned to an 8 byte boundary on entry to this helper, the four instructions above will definitely kill you.  The symbolic name that GDB reports for these instructions makes me infer that this is a gate-keeper function intended solely to ensure that the ABI stack alignment requirement is met.  If you wonder why the alignment on entry has to be 8, recall that we've stepped through a linker constructed thunk.  So if you back off the return address of that thunk, you have 4 bytes, which is the return address into the user code.  Back off that return address, and you are down to the 16-byte alignment that the ABI requires at the point of the function call.

This little gate-keeper is draconian since from a practical standpoint it means we have to maintain the 16-byte stack alignment in pretty much all of our code.  The only time you can avoid keeping the stack aligned is if you can guarantee that the call tree you are dealing with will never escape your local unit (in the case of Pascal code).  Why?  Because at compile time, you cannot guarantee that any particular unit you are making a reference to will not be in a package, and hence reached via the dynamic loader.

For those of you thinking about Mac OS X, this means that you should do some planning.  If you have assembly code which is not leaf code, you should inspect it very carefully to see that none of the call trees can escape the unit in which the code lives.  Remember that the compiler will use helper functions in native Pascal code for a lot of things, like reference counting interfaces, copying strings, etc.  Those helpers live in the System unit, and if you are linking against packages, you'll go through the dynamic loader to get to them.  Even if you don't link against packages, some helpers might drop to the O/S for memory allocation, which will go through the dynamic loader.  So if your assembly code calls what looks like a leaf function in the same unit, where that function is implemented in Pascal, you have to make sure that no helpers are used, if you don't maintain stack alignment in your assembly code.  Your alternative is to go through the assembly code, and manually align the stack prior to each call.  I can tell you that I did that for a bunch of code, and it made my teeth ache.

Another option to consider is to just keep Pascal versions of your assembly coded routines, and use those, at least initially, on OS X.  There are plenty of good reasons to keep around high level versions of these routines anyway.  I'm very fond of this option, personally.

It's somewhat interesting to look at what gcc does for stack frame construction on OS X.  They always build a full EBP frame, and adjust the stack by the largest amount of local storage required up front in the prolog.  The stack is aligned once at that point, and from then on, no PUSH instructions are used.  I believe this is more efficient in the long haul, but it requires very different management of function return results when building parameter lists where there are nested function calls in the parameter lists.

So to recapitulate, on entry to your functions, you will find the stack aligned to 16-bytes, minus the return address.  In other words, ESP will always be 0xnnnnnnnC.  If you want to call a function in another unit, you have to ensure the stack is aligned at the point of the call instruction.  Here are some examples:
procedure myAsmFunc;
// ESP will be 0xnnnnnnnC
// call procedure A(a: integer, b: integer, c: integer); cdecl;
push 0
push 1
push 2
call A // OK, because ESP is now 0xnnnnnnnC - 12
add esp, 12
// call procedure B(a: integer, b: integer); cdecl;
push 0
push 1
call B // _NOT_ OK, because ESP is now 0xnnnnnnnC - 8
add esp, 8
// call procedure B(a: integer, b: integer); cdecl;
push ecx // add a dword to make the alignment come out
push 0
push 1
call B // that's better, because ESP is now 0xnnnnnnnC - 12
add esp, 12

If you write pure Pascal code, then the compiler will take care of all the alignment work for you.  Once you've done the hand alignment tweaking on existing, beautifully tuned IA-32 assembly code you may find your opinions shifting around on you with respect to how to approach the problem. Best of luck to you.

Oh, one more thing - you have to be careful with your testing of the manual alignment solution.  Just because it runs doesn't mean you got it right. Once you make the first call to the dynamic loader function, the OS will back-patch the thunk with the actual function address that you wanted to bind to.  So if some other bit of code calls that thunk before your assembly code calls it, the gate-keeper code may no longer be on duty.  It's best if you single step the code, and verify that ESP & 0xF == 0 at all call instructions.


  • Guest
    Ray Vecchio Wednesday, 20 May 2009

    Welcome BACK!!! Mr TLink!! (I'll never forget Matt Pietrek's comment!) Will we see a contingency deorbit in there??


  • Guest
    115 Thursday, 21 May 2009

    It seems only yesterday that I learnt about contingency deorbit, but of course, it was more than a while back...

    Cheers, #R; :op

  • Guest
    Ivan Revelli Thursday, 21 May 2009


  • Guest
    Ivan Revelli Saturday, 23 May 2009

    incredible... with wich technologie you think to implement vcl for cross platform ? qt , gtk o others?
    Good job, really cool...
    In facts the really what i need to other system is the ability to produce .so for apache but ..
    sorry but i not speak english well

  • Guest
    Atle Monday, 25 May 2009

    How about introducing a small amount of compiler magic instead of rewriting a lot of code. The compiler should be able to parse and count how the stack is aligned before any call and create measures in between your assembly code. That way you don't need to control this. You could create a conditional define to disable this magic for parts where you don't want it.

  • Guest
    Eli Boling Monday, 25 May 2009

    Ivan, I can't really comment on the VCL side of things. You'd have to ask Allen Bauer about that. I'm dealing with the compiler, and RTL currently.

  • Guest
    Eli Boling Monday, 25 May 2009

    Atle, compiler magic won't help for assembler code. There are too many ways to construct a situation with assembler code (e.g. data dependent code paths), where the compiler would never be able to do the right thing for stack alignment. It isn't worth the large amount of effort this would be. Furthermore, if we _did_ come up with something, it would have the downside that you would have this magical reduction in performance of your assembler code because of the stuff we injected. I don't think that would be popular at all.

  • Guest
    Atle Tuesday, 26 May 2009

    Yeah, there are a lot of different situations even though most code follow spesific patterns. There are one way that could be easier though, by moving all changes right before and after any call. But that means moving prepared parameters, and a reduction in performance.

    Does this aligning stuff impact enough to legitimate a use of another register to track the difference. EBX is already defined special for linux, and maybe that register could be used to always keep a realign value for the stack (or the stack value), and this is used to adjust ESP back to normal when entering a proc. Would simplify any calls without bigger performance issues. But that would only work when the libraries come from the same compiler.. But maybe that is the case and other cases could be detected to use other ways?

  • Guest
    Eli Boling Tuesday, 26 May 2009

    Atle, we'll be using EBX as well on Mac OSX for PIC support. Very similar model to Linux. I'd have to go into hiding in the woods someplace if I proposed taking another register away from folks.

  • Guest
    Atle Tuesday, 26 May 2009

    I haven't worked with Mac's, didn't know that :) Well we don't want you hiding around in the woods, and with so few registers there seem to be few options :) I'm just ranting around here without background information, just hoping that maybe something could be usable. The whole alignment control they have put in there seems very odd with no obvious positive effects, but maybe they have some good reasons for it (that I don't see).

    Good luck with your alignment ;)

  • Guest
    Alex Bakaev Friday, 29 May 2009

    It's awesome to have you back!

    Barbarians are gathering :)

  • Guest
    PhiS Monday, 15 June 2009

    Is there a chance for an ALIGN directive for ASM code/data? That would obviously come in handy here - and has been requested many times for Win32 assembly code.

  • Guest
    Eli Boling Monday, 15 June 2009

    @PhiS: we'll consider it, but it will be lower priority, compared to getting the porting issues resolved.

  • Guest
    Ze Mané Thursday, 16 July 2009

    Nao entendi nada!

  • Guest
    Wim L Thursday, 20 August 2009

    I'm guessing the alignment is needed for the movdqa instruction to work— lots of SIMD extensions have stringent alignment restrictions. The symbol is just there so that if you get the alignment wrong, the symbolic backtrace triggered by the alignment trap will give you a hint what happened.

    The XMM registers aren't callee-saved, so I don't know why it's storing them to the stack. Maybe they're used for argument passing in some cases.

  • Guest

    [...] (including Extended) require to be aligned on 16-byte boundaries. For explanation, see the Mac OS X Stack Alignment article by Eli [...]

  • Please login first in order for you to submit comments
  • Page :
  • 1