Ignore this... typo .. reinvestigating...
Circa 3% improvement after a number of runs. Interesting! Fair play to intel :) I shall get my hat. And that's calling my super contrived example:
#include "e.h" int func(PassInStack s) { return s.a + s.b; } int func(PassInRegister s) { return s.c + s.d; }
However! My PassInRegister called function is using the red zone of the stack for temporary storage to separate the struct to do the addition ... (struct is in RDI, result in EAX)!
_Z4func14PassInRegister: ## @_Z4func14PassInRegister pushq %rbp movq %rsp, %rbp movq %rdi, -8(%rbp) movl -8(%rbp), %eax addl -4(%rbp), %eax popq %rbp retq
I'll contrive the example further ...
Which makes it a 7% gain for 128bit structures. 3% for your 64bit one...