Pass By Value


#1

Speaking of rainy days.  From the coding standard:

When passing small, POD objects into functions, you should always pass them by value, not by reference.
Most experienced C++ coders, myself included, fall into the habit of always passing function parameters as const-references, e.g. "const Foo&". This is the right thing to do for complex objects (e.g. Array, String, etc), but when you pass a reference, it prevents the compiler from using a whole slew of optimisation techniques on the call-site. For example, it means that there's no way for the compiler to really know whether the function will modify the original value (via const_cast) or whether it will modify a memory address which is an offset from the location of the object.

So, the best-practice advice from the guys who actually write the optimisers is: Always stick to pass-by-value if possible, and only use references if the price of calling the copy constructor is very high.

This is particularly true in the case of small objects whose overall size is actually not much bigger than the size of a pointer. Some juce classes which should always be passed by value include: Point, Time, RelativeTime, Colour, all of the CharPointer_XYZ classes, Identifier, ModifierKeys, JustificationType, Range, PixelRGB, PixelARGB.

(I suspect that even larger structures like Rectangle<int> and Rectangle<float> may benefit from this technique too, especially on 64-bit CPUs.. If anyone has the time to create benchmarks to find out for sure, let me know!)

(Ignore the nan)   

The results are in ... it makes bugger all difference for a Rectangle.   

 

                   ByReference min:308163 max: 326298 average: 314090 stddev: 4702.17

                       ByValue min:271485 max: 582046 average: 328141 stddev:   nan

 

                  FloatByValue min:352726 max: 383672 average: 363982 stddev: 8798.24

              FloatByReference min:346290 max: 481381 average: 378697 stddev: 7714.52

 

This was calling a function in another object file, using the standard release build settings on Clang.   A quick play with godbolt suggests that this changes entirely if you use GCC with concepts (I presume some non-standard standard extension??) where the registers are used for passing structs of four floats or ints (but not doubles). 

 


#2

Godbolt: http://goo.gl/PIlvQX

Compare the assembly when switching between compilers for the pass by value case ... interesting ... ?


#3

Cool link. I'd imagine the GCC with concepts is just their working version of the C++17 standard, with concepts as they're a bit intrusive.

Have you just got the one file in that example (I'm not familiar with Godbolt)? The reason is that the type of optimisations we're talking about here would happen inside the functions, not the calling functions. The optimiser can do a lot more with an object it knows has no outside influence.

It is possible that you'd end up in this situation by accident if a pass-by-reference function is inlined so that the object it is acting on is the stack-local variable and not a reference to it.

It's also worth noting that it's very difficult to write a test for these kinds of things. Even if the assembly is the same for a small use-case this can change drastically with larger programs where features like inline tolerance and link-time optimisations will change.

 

Basically, it's always better to have a local copy of an object (not just for optimisations but it's good practice for keeping methods const and you're likely to get better cache locality) unless the cost of copying that object is high. This is where pass-by-value container objects come in handy e.g. Value, Image etc.

Creating copies of simple objects is not necessarily very costly. With things like move semantics, inlining, RVO and NVRO you'll likely be doing a lot less copies than you think and even then I doubt copying a few bytes is the thing that's slowing your program up (although this is just conjecture).

I spent a long time trying to find the point at which to use a reference over value, the only quantifiable thing I've heard is Herb Sutter say to prefer value if the object is the size of two pointers or less.

It's also worth remembering optimisers will change (and hopefully improve) over time (along with processor instructions etc.) and this two pointer threshold may increase in the future.

 

(What a late-night debate to start up!)


#4

Cool link. I'd imagine the GCC with concepts is just their working version of the C++17 standard, with concepts as they're a bit intrusive.

Have you just got the one file in that example (I'm not familiar with Godbolt)? The reason is that the type of optimisations we're talking about here would happen inside the functions, not the calling functions. The optimiser can do a lot more with an object it knows has no outside influence.

I was just looking at the difference in calling optimisations.  

It is possible that you'd end up in this situation by accident if a pass-by-reference function is inlined so that the object it is acting on is the stack-local variable and not a reference to it.

It's also worth noting that it's very difficult to write a test for these kinds of things. Even if the assembly is the same for a small use-case this can change drastically with larger programs where features like inline tolerance and link-time optimisations will change.

Agreed. I think my test is far from comprehensive.

I was both testing and reviewing the asm for the no-inlining situation.  I can't make the mental leap as to why the optimisation would be different (between pass by value and const ref) in the inlined situation.  But maybe it'll become clear if I experiment.

Basically, it's always better to have a local copy of an object (not just for optimisations but it's good practice for keeping methods const and you're likely to get better cache locality) unless the cost of copying that object is high. This is where pass-by-value container objects come in handy e.g. Value, Image etc.

Lost.  If a pass by value container is essentially: 

struct Container { Object * data; }

Why is passing by value an optimisation over passing const Object & data (or const Object * data)?  The data ain't moved?

Creating copies of simple objects is not necessarily very costly. With things like move semantics, inlining, RVO and NVRO you'll likely be doing a lot less copies than you think and even then I doubt copying a few bytes is the thing that's slowing your program up (although this is just conjecture).

Usually my piss poor choice of algo is the slow bit. :) 

I spent a long time trying to find the point at which to use a reference over value, the only quantifiable thing I've heard is Herb Sutter say to prefer value if the object is the size of two pointers or less.

How does the calling convention play into this.  Does it?

It's also worth remembering optimisers will change (and hopefully improve) over time (along with processor instructions etc.) and this two pointer threshold may increase in the future.

 

(What a late-night debate to start up!)

Time for a round of beers I say.  Though I'm definitely not starting this conversation without a laptop handy to settle arguments :)


#5

Hmm, here's where replying to replies gets complicated on the forum..

I was just looking at the difference in calling optimisations.  

Yes, there's unlikely to be much difference in the calling code. Optimisers are most effective when they know what they're optimising, in this case the called function. It's the code that actually deals with the object that can be improved. Otherwise the only difference is whether you're passing a pointer (i.e. reference) or constructing an object (although I'm pretty sure that happens in the calling function's code as well).

 

I was both testing and reviewing the asm for the no-inlining situation.  I can't make the mental leap as to why the optimisation would be different (between pass by value and const ref) in the inlined situation.  But maybe it'll become clear if I experiment.

Tricky to explain but basically if you inline a pass-by-reference function call then you (may have) removed the reference so the optimiser is free to do with the object what it likes. Consider this:


void enlarge (Rectangle<int>& rect)
{
    otherFunctionCall();
    rect *= 2;
}

Rectangle<int> parentFunction()
{
    Rectangle<int> rect (0, 0, 100, 100);
    enlarge (rect);

    return rect;
}

Assuming this doesn't get inlined and the optimiser can't see the implementation of enlarge or
otherFunctionCall. At this point, the enlarge function only has a reference to rect. The optimiser doesn't know that rect doesn't alias some other variable. This means it doesn't know that otherFunctionCall doesn't modify rect. This in turn means that it may not be able to perform some optimisations on rect (in this instance it's likely to be cache or pipeline related but could be a compile time compute).

 

Now imagine enlarge gets inlined, we end up with:

Rectangle<int> parentFunction() 
{ 
    Rectangle<int> rect (0, 0, 100, 100);
    otherFunctionCall();
    rect *= 2;

    return rect;
}

Now it's clear that otherFunctionCall can't modify rect and any modifications to rect can be optimsed as hard as possible (in this case the rect and the multiply is likely to be computed at compile time and put directly into the calling function's stack frame so this will essentially boil down to a call to otherFunctionCall.

Like all examples it's contrived and pointless but hopefully explains the original question about how inlining can help pass-by-reference.

(I'll put the rest into a new post)


#6

Lost.  If a pass by value container is essentially: 

struct Container { Object * data; }

Why is passing by value an optimisation over passing const Object & data (or const Object * data)?  The data ain't moved?

It's not really, but it's good practice. There's nothing to say these types won't become POD at some point in the future. They also might contain other bits of information that you're interested in that are members rather than pointers (for example a reference count).

What I'm trying to say is that it's worth considering this paradigm as it *could* be optimised better. Consider classes like Identifier and StringRef, they're just wrappers around pointers but by having value semantics they're less klugy to use and you don't have to worry about any of these nullptr checks or ownership problems.

 

Usually my piss poor choice of algo is the slow bit. :) 

I hear you there. Have fun un-picking those though!

 

How does the calling convention play into this.  Does it?

Not sure I understand this. If you're passing an object (by value or reference) you're calling a function? What I was trying to get at was if you're passing a small object (Point, Range etc.) pass it by value, large objects (Array, Path etc.) pass by reference.

N.B. One side note is that it should be very clear we're talking about passing function 'arguments' here. If you're returning large objects you've created you should almost always return these by value. That's a whole different set of optimisations though..

It's also worth noting that you shouldn't really be writing code for the optimiser. Who knows what it's going to do! You should be writing code that's easy to reason about. I've generally found that if a person can reason about a piece of code the compiler/optimiser/linker can also!

 

Time for a round of beers I say.  Though I'm definitely not starting this conversation without a laptop handy to settle arguments :)

Indeed. Although good luck trying to find answers to these sort of questions in an inebriated state!


#7

I'll have a proper read later, and whilst I don't think I fully understand it yet, you have given me an idea about a more valid way of testing the relative performance of the pass-by-value options. 

And I think my godbolt example isn't useful.

However I've just got a cracking one showing what you are talking about worked out.  But the train has arrived!!! :(


#8

The easiest example of your point from last night is below (the C source). Followed by the assembler output from Xcode clang with -O2. The interesting lines I’ve highlighted with ###. I think it’s pretty clear without running it which will be faster.

And, from the performance test I did last night, and the assembly, I think it’s also pretty clear that the calling cost for the two is damn near equivalent.

I’m sure I can come up with a more realistic example, but I think the trivial example works.

And the calling convention is partially relevant. Because it’s the cost of the calling that’s being balanced against the possible optimisations. According to MS documentation (I’ve not tried it) a struct up to 64 bits can be passed as an integer in a register.

On UNIX/MacOS structs up to 128bits are passed in the registers.

Presumably things fair much worse on a 32-bit app.

_____

struct S
{
   int a, b, c, d;
};

extern void externalFunctionVal(S);
extern void externalFunctionRef(const S &);

int somefunctionV()
{
   S s { 1, 0, 0, 0 };
   externalFunctionVal(s);
   s.a *= 100;
   return s.a;
}

int somefunctionR()
{
   S s { 1, 0, 0, 0 };
   externalFunctionRef(s);
   s.a *= 100;
   return s.a;
}
______ASM OUTPUT


	.section	__TEXT,__text,regular,pure_instructions
	.globl	__Z13somefunctionVv
	.align	4, 0x90
__Z13somefunctionVv:                    ## @_Z13somefunctionVv
	.cfi_startproc
## BB#0:
	pushq	%rbp
Ltmp2:
	.cfi_def_cfa_offset 16
Ltmp3:
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
Ltmp4:
	.cfi_def_cfa_register %rbp
	movl	$1, %edi
	xorl	%esi, %esi
	callq	__Z19externalFunctionVal1S
	movl	$100, %eax		### << WHOOOPING OPTIMISATION.
	popq	%rbp
	retq
	.cfi_endproc

	.globl	__Z13somefunctionRv
	.align	4, 0x90
__Z13somefunctionRv:                    ## @_Z13somefunctionRv
	.cfi_startproc
## BB#0:
	pushq	%rbp
Ltmp7:
	.cfi_def_cfa_offset 16
Ltmp8:
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
Ltmp9:
	.cfi_def_cfa_register %rbp
	subq	$16, %rsp
	movq	l__ZZ13somefunctionRvE1s+8(%rip), %rax
	movq	%rax, -8(%rbp)
	movq	l__ZZ13somefunctionRvE1s(%rip), %rax
	movq	%rax, -16(%rbp)
	leaq	-16(%rbp), %rdi
	callq	__Z19externalFunctionRefRK1S
	imull	$100, -16(%rbp), %eax   ### << STARS ARE BORN FASTER THAN THIS.
	addq	$16, %rsp
	popq	%rbp
	retq
	.cfi_endproc

	.section	__TEXT,__const
	.align	2                       ## @_ZZ13somefunctionRvE1s
l__ZZ13somefunctionRvE1s:
	.long	1                       ## 0x1
	.long	0                       ## 0x0
	.long	0                       ## 0x0
	.long	0                       ## 0x0


.subsections_via_symbols

#9

At the risk of talking to myself, for the Mac (and other UNIX which shares the same 64-bit ABI):

  • Applying the default copy constructor enables the pass in register optimisation.
  • Using =default enables the the optimisation.
  • Implementing a copy constructor, even the trivial one, blocks the optimisation.

This applies for 32 bit integers and floats.  It doesn't apply for doubles:

Here's my poor mans emulation of Rectangle* (and Point):

#include <stdint.h>
struct P 
{
    public:
        P() :
            a(1), b(1) {}
        uint32_t a, b; 
};
struct S
{
    public:
        S(int c, int d) : c(c), d(d) {}
        S(const S&) = default;
        P p;
        uint32_t c, d;
};
extern void func(S s);
void start()
{
    S s { 1, 2, };
    func(s); 
}

void start() itself compiles with -O2 into this tidy code: 


    movabsq    $4294967297, %rdi       ## imm = 0x100000001
    movabsq    $8589934593, %rsi       ## imm = 0x200000001
    popq    %rbp
    jmp    __Z4func1S              ## TAILCALL

 

Can the Rectangle<> and Point<> objects could be adjusted to use the default copy constructors, either for C++11 with some #define and an = default, or by just removing the existing copy constructors?  If so there's probably some free performance to be had in some circumstances when combined with pass-by-value.  

I'm going to have a play and see how much this actually helps in reality over a cup of tea...

* yes it does compile with the extra comma at the end of that list - which surprised me. Looks bloody untidy though.


#10

You're certainly not talking to yourself! This is all very interesting and not something I've actually delved into before.

Your example also exhibits the movabs optimisation if no copy constructor is supplied so perhaps this would be a better option (provided the compiler can generate it on all platforms).

I'll be honest I don't really know much about specific assembly instructions, why removing the C++11 flag from the example spits out a ton more assembly, why ARM also has a lot more instructions etc.

What I think is clear in this example is that the compiler is better at generating code than we are so where possible it should be left to it.

 

I do think we've digressed into two different talks about optimisation though, one about how to call a function in the fastest way and one about why have a unique, value object within a function is good.

If I'm correct we can summarise by saying:

  • Let the compiler generate copy constructors for you
  • If you provide a destructor, copy constructor or assignment operator, provide all three (even if they're default)
  • Where possible provide move constructors and assignment operators (for non-value types)
  • Again, if you provide any move operators also provide destructor and copy operators (rule of 5)
  • Prefer to pass by value if the object is small (<= 2 pointers)
  • Having a unique concrete object (not a pointer to one) provides the best scope for optimisation

​Any other guidelines?


#11

Ha - you've introduced a whole new problem space there: rules of X and constructors.  I think they are rather over simplified, as far as I know it's safe to: 

  • Omit the destructor if you are only holding RAII type classes. 
  • Omit one or both of the move operators. 

But it's definitely iffy if there's a non-trivial copy constructor or copy assignment operator alone. Or a move constructor/assignment operator without both the copy constructor and assignment operator. 

However -Wdepreciated in clang will highlight this as a problem.  I think the plan was to make it an error in a future version of the standard, is it already in C++14?

Which makes it more like a rule of 'at least 2'? 

I'll try get my head around the rest of the summary in a sec :)  


#12

You you write a destructor the move constructor/assignments will not be generated by the compiler. This can have drastic effects when using these objects in std containers as they'll have to be copied when the array is resized instead of moved.

The reasoning behind this is that if you write a destructor (even if it is empty) it implies your class has some non-standard destruction behavior and hence would need special rules for moving.

As you say, with RIAA classes you rarely need to write a destructor these days. So my rule of thumb is if you write a destructor, write move constructor/assignments as well.

 

POD objects should really be able to generate default copy/assignment operators so as we've discovered it's probably best to leave them up to the compiler as they'll do a better job of optimising them. With POD objects obviously there is nothing to move so these can be committed.

If the class is not POD and you write either a destructor, copy constructor or assignment operator this will disable the default move constructor/assignments, hence the need to specify all of them (rule of 5).

 

I guess the correct (although time-consuming) thing to do is create unit tests for your objects and use type-traits to check they are moveable. This would catch instances where someone accidently adds some code that disables them.


#13

The reasoning behind this is that if you write a destructor (even if it is empty) it implies your class has some non-standard destruction behavior and hence would need special rules for moving.

Generally I agree with you.  But surely writing a move constructor would be a better way of signalling that a class had special move behaviour, rather than a blank destructor ! :)  

I guess the correct (although time-consuming) thing to do is create unit tests for your objects and use type-traits to check they are moveable. This would catch instances where someone accidently adds some code that disables them.

Sounds like a job for some (very useful) automated tool that one!   It could apply some pretty simple rules to determine whether the move constructors were in place.  

Though testing the move would be fancier ;) 


#14

As promised, summary of the other stuff: 

http://blog.credland.net/2015/06/pass-by-value-internals.html

I still havent' actually tested the performance improvement in reality ... suggestions? I got distracted writing some tests and produced something that made pretty rectangles float around the screen instead :)


#15

Generally I agree with you.  But surely writing a move constructor would be a better way of signalling that a class had special move behaviour, rather than a blank destructor ! :) 

Yes, I agree it could be a bit smarter about this but where do you draw the line? I guess this is to stop things like objects removing themselves from lists in their destructor if they've been moved out of. How does the compiler know what code needs to change if it's been moved? But yes, a blank destructor could be a special case. (But it's less code to not write one which I like).

Sounds like a job for some (very useful) automated tool that one!   It could apply some pretty simple rules to determine whether the move constructors were in place.

Actually, this is simpler than I thought as you can test the type-trait in a static assert just below the class. This could be wrapped up in a macro to make it C++11 only. E.g.

static_assert (std::is_move_constructible<S>::value, "Not move constructable");
static_assert (std::is_move_assignable<S>::value,    "Not move assignable");

It would be great if there was a type trait that exposed the pass-by register optimisation but I can't seem to find a combination that works. This is probably too low level for the compiler really.

 

Nice article! We've all got distracted by making moving rectangles..

Interesting to observe how const here has no effect on the code.  Removing the const from externalFunctionRef generates exactly the same assembly.

Yes, this is something that a lot of programmers find hard to grasp so it's nice to point it out like this. I'm a big fan of const (I use it everywhere) but it's important to realise it's only a compile time feature. It's better to view it as a static assertion than any special optimisation. It's there as a safeguard to stop you (and other programmers) doing stupid stuff by changing state that needs to stay constant. In reality the optimiser can tell if an object is const (and this actually turns out to basically be if it has a unique local object with value semantics).

Sidebar: I would really like a 'const auto' keyword in C++, a bit like Swift's 'var' and 'let', although I can't think of a 4 letter version so the code lines up! ('cnst' perhaps?)

Sidebar 2: IIRC there is a proposal for the standard to add a keyword that is a 'const' in the way that a lot of people think of it. It's similar to what 'restrict' tried to be in that it is a user provided guarantee that the object will not be changed by the rest of the program. Not sure if it got into C++17 though.. Sounds a bit dangerous too, leave this sort of thing up to the optimiser, it's much smarter than we are!

 

If you want a real world test it would be interesting to see if making Rectangle use this optimisation has any effect on the graphics processing speed. Whenever I've optimised drawing code, after a while the bottleneck always ends up in the edge-table creation. There's no getting away from that but if this optimisation made moving these rects around faster I'd imagine that's where you'll see a real-world benefit.


#16

I must say you are way ahead of me on the type-traits.  I read it. Understood it.  Forgot it.  Need to read it again!

I guess the formula for register optimisation is sizeof(x) < 4 (WIN32) 8 (UNIX) and the absence of a user defined copy constructor.  Can you check programatically for the second?  I'm not sure how. 

If you've got a graphics scenario which is heavy on the rectangles ... i guess they all are.  it's a very simple modification to rectangle, and a quick check with the disassembler, to implement.  Just swap out the Point and Rectangle copy constructors for = default.  I can't see why that wouldn't work.

PS. cnut < const + auto macro?


#17

I guess the formula for register optimisation is sizeof(x) < 4 (WIN32) 8 (UNIX) and the absence of a user defined copy constructor.  Can you check programatically for the second?  I'm not sure how. 

Yes, although I wouldn't test for the sizeof, it's too compiler dependent and likely to change. If you had a Rectangle<double> you don't want to break your build. Perhaps this?

static_assert (std::is_trivially_copyable<S>::value, "Non-trivially copyable");

Although that isn't available on all compilers (any version of Clang and GCC 4.9 for example).

It's worth noting that you don't actually need the =default copy constructor, you can just omit it to get the default, providing it hasn't been disabled by the presence of a destructor. Perhaps this:

static_assert (std::is_trivially_destructible<S>::value, "Non-trivially destructible");

So it looks like we're almost there! I'll try the JUCE graphics demo and see if it makes any difference.

 

PS. cnut < const + auto macro?

Interesting idea although (and this goes for some of my earlier suggestions too) we should be avoiding macros like the plague these days. When modules are introduced with C++17 they won't be valid. And I'm hoping modules are going to be the single biggest improvement to C++ since lamdas and initialiser lists. Think 10x compile times and IDEs that can actually jump to symbols before a solar eclipse!


#18

I'll try the JUCE graphics demo and see if it makes any difference.

And the results are... No perceivable difference.

Moral of the story: Don't optimise


#19

Interesting. Probably depends what you are doing.   I refuse to believe there's no difference.  Let me time something really boring and analytical with no relevance to the real world and see how it does. 

I'll get back to you.


#20

Interesting idea although (and this goes for some of my earlier suggestions too) we should be avoiding macros like the plague these days. When modules are introduced with C++17 they won't be valid. And I'm hoping modules are going to be the single biggest improvement to C++ since lamdas and initialiser lists. Think 10x compile times and IDEs that can actually jump to symbols before a solar eclipse!

I'm been ignoring all temptation of looking at ++14 and ++17.  Having built lots of product against ++11 and had trouble with backwards compatibility ... once burned ;-)