I’m writing this on the off chance somebody knows what’s going on. I recently upgraded JUCE from a version on May 2019 to the latest master as of today. Purely from that update, I’m seeing increase of CPU usage in iOS from 9% to 19%! The only changes to the code were a massive amount of std::atomic* and a handful of std::unique_ptr.
Well this is quite shocking. I decided I would have to endure the horrific task of bisecting the JUCE repo in order to find the point at which the cpu increase occurred. Thankfully, I had a hunch to try commit 70ddcd which is the commit: APVTS: Use atomic floats for current parameter states
I can confirm that commit eb780b (the one right before 70ddcd) has 9% CPU in iOS and commit 70ddcd (including atomic* and unique_ptr changes needed) has 19% CPU! So my bisection task has been graciously terminated.
Is there a chance that all these atomic* are resulting in a CPU increase? I really have no idea.
Anyway, I’m considering using the latest JUCE revision but reverting (or trying at least) the changes made in commit 70ddcd to see if I can have a working solution with the CPU performance I used to have. Can anyone at JUCE think of a reason why not to do this? I really need to use the latest JUCE revision as for some reason the old May2019 revision no longer works for android (null pointer exception that I can’t seem to fix).
yes, dereferencing the atomic pointer and using the stored value is calling load() internally. So you want to take all of those out of your audio loops so they’re only called once per block.
@eyalamir@modosc I really appreciate your responses here. So it looks like this is a known problem with atomic. I just successfully reverted the atomic changes from commit 70ddcd on the latest revision of JUCE and, indeed, the CPU performance is back to normal. So I have something to fall back on in case ->load(std::memory_order_relaxed) doesn’t give me good enough performance.
So I take from what you’ve said is that I should reference an atomic as little as possible in my code!
Just to make it more obvious for your way of working I’ve modified eyalamir’s code:
If you were to look at the assembler generated for this code you would see that it will take the value from the std::atomic and place it in a register for use within the loop. The version which uses the *gain directly inside the loop would have to load from memory every iteration of the loop, which is pointless for a value that will not change (I added const to the example to make it obvious to the compiler that it cannot change), and will be orders of magnitude slower than using a register.
Great advice. I learned a long time ago to keep inner loops as simple as possible. I’m quite old school, so when it comes to code in inner loops, as a rule of thumb I assume that the compiler is dumb and won’t spot what look like to me like obvious optimisations. The case above is a very good example of this. Get the value once outside the loop; problem never exists in the first place. HTH! Pete
Its good practice to make a local copy each used parameter value in the beginning of the processBlock routine, among the atomic performance problem, it theoretically can change while running the processBlock method.
There shouldn’t be a performance difference unless the optimiser was doing something previously you weren’t intending…
The main thing with std::memory_order_relaxed is that it might not actually be working as you expect as it means reads and writes can be re-ordered so you might get old values. (This isn’t usually a problem with visuals but probably not what you want on the audio thread).
If you want to get better performance than sequentially consistent but your code to be correct you probably want acquire/release ordering on load/stores.
I’ve never actually profiled them directly as the use case and hence gains will be different but it will be interesting to see your results if you do go down this route.
I recommend not to use std::memory_order_relaxed in audio threads, in UI you’re probably ok but that’s less CPU critical anyway so I often don’t specify memory ordering at (which defaults to std::memory_order_seq_cst, “sequential consistency”).
Ok cool, so it looks like I’m implementing in the style of *gain across the board at the start of every process block and at the start of the timer callback in my editor. I’ll report back with the findings of cpu improvements.
If I still have an unacceptable CPU hit I’m going to roll back to original pointers by reverting JUCE commit 70ddcd. I can’t afford to have any CPU increase on iOS or older devices will not keep up (Spacecraft is at it’s limit on many older devices so I don’t have any room for increase!).
Thanks to all that have contributed to this very helpful thread!
Moving them to the start of process block is the change I’m about to make. Currently, the dereferences are scattered all throughout my code, inside audio loops, and they are legion! This must be a best practice I missed along the way, oops.