Massive CPU increase on latest version of JUCE

I’m writing this on the off chance somebody knows what’s going on. I recently upgraded JUCE from a version on May 2019 to the latest master as of today. Purely from that update, I’m seeing increase of CPU usage in iOS from 9% to 19%! The only changes to the code were a massive amount of std::atomic* and a handful of std::unique_ptr.

Does anybody have a clue what’s going on?

1 Like

UPDATE!!

Well this is quite shocking. I decided I would have to endure the horrific task of bisecting the JUCE repo in order to find the point at which the cpu increase occurred. Thankfully, I had a hunch to try commit 70ddcd which is the commit: APVTS: Use atomic floats for current parameter states

I can confirm that commit eb780b (the one right before 70ddcd) has 9% CPU in iOS and commit 70ddcd (including atomic* and unique_ptr changes needed) has 19% CPU! So my bisection task has been graciously terminated.

Is there a chance that all these atomic* are resulting in a CPU increase? I really have no idea.

Anyway, I’m considering using the latest JUCE revision but reverting (or trying at least) the changes made in commit 70ddcd to see if I can have a working solution with the CPU performance I used to have. Can anyone at JUCE think of a reason why not to do this? I really need to use the latest JUCE revision as for some reason the old May2019 revision no longer works for android (null pointer exception that I can’t seem to fix).

The atomics were a great addition that solved a bunch of problems with the old system.

But! there’s actually a good chance they are adding a major overhead if you’re loading them constantly inside hot loops

If you have code like:

for (int sample = 0; sample < numSamples; ++sample)
    buffer[sample] *= gain->load();

The compiler won’t be able to optimize it, due to the nature of an atomic instruction.

Modifying the code to:

auto localGain = gain->load();

for (int sample = 0; sample < numSamples; ++sample)
    buffer[sample] *= localGain;

Should allow the compiler to bring back all these optimizations for the now-clean loop.

2 Likes

i found the atomic performance terrible unless i used

gain->load(std::memory_order_relaxed)

this (mostly) removed the performance hit i saw when i integrated the initial atomic changes to APVTS.

1 Like

Thanks for your response. I don’t use ->load() anywhere in my code, I just reference with *. Does that make a difference regarding your suggestion?

yes, i was dereferencing too and profiling led
me to the atomic call. using the options above fixed it for me.

yes, dereferencing the atomic pointer and using the stored value is calling load() internally. So you want to take all of those out of your audio loops so they’re only called once per block.

1 Like

@eyalamir @modosc I really appreciate your responses here. So it looks like this is a known problem with atomic. I just successfully reverted the atomic changes from commit 70ddcd on the latest revision of JUCE and, indeed, the CPU performance is back to normal. So I have something to fall back on in case ->load(std::memory_order_relaxed) doesn’t give me good enough performance.

So I take from what you’ve said is that I should reference an atomic as little as possible in my code!

Thanks again guys.

Just to make it more obvious for your way of working I’ve modified eyalamir’s code:

If you were to look at the assembler generated for this code you would see that it will take the value from the std::atomic and place it in a register for use within the loop. The version which uses the *gain directly inside the loop would have to load from memory every iteration of the loop, which is pointless for a value that will not change (I added const to the example to make it obvious to the compiler that it cannot change), and will be orders of magnitude slower than using a register.

You can see in this noddy example that this is true.

4 Likes

Great advice. I learned a long time ago to keep inner loops as simple as possible. I’m quite old school, so when it comes to code in inner loops, as a rule of thumb I assume that the compiler is dumb and won’t spot what look like to me like obvious optimisations. The case above is a very good example of this. Get the value once outside the loop; problem never exists in the first place. HTH! Pete

1 Like

Its good practice to make a local copy each used parameter value in the beginning of the processBlock routine, among the atomic performance problem, it theoretically can change while running the processBlock method.

6 Likes

Do you know if there’s a performance difference between using *gain vs. gain->load(std::memory_order_relaxed) as @modosc suggested?

Either way I have a massive task ahead of me to get this done :smiley:

There shouldn’t be a performance difference unless the optimiser was doing something previously you weren’t intending…

The main thing with std::memory_order_relaxed is that it might not actually be working as you expect as it means reads and writes can be re-ordered so you might get old values. (This isn’t usually a problem with visuals but probably not what you want on the audio thread).

If you want to get better performance than sequentially consistent but your code to be correct you probably want acquire/release ordering on load/stores.

sorry Dave I have no idea what acquire/release ordering on load/stores means :smiley:

It’s Mark by the way, I use a weird email for JUCE account :wink:

I can tell by your picture :wink:

Acquire/release are like semi-relaxed atomics.
Basically you have to supply the memory ordering when you do load/stores.
E.g.

std::atomic<float> atomicVariable { 0.0f };
atomicVariable.store (1.0f, std::memory_order_release);

float localVal = atomicVariable.load (std::memory_order_acquire);

I’ve never actually profiled them directly as the use case and hence gains will be different but it will be interesting to see your results if you do go down this route.

gotcha, thanks. But it sounded from your previous post that you were recommending not to use std::memory_order_release no?

I recommend not to use std::memory_order_relaxed in audio threads, in UI you’re probably ok but that’s less CPU critical anyway so I often don’t specify memory ordering at (which defaults to std::memory_order_seq_cst, “sequential consistency”).

Ok cool, so it looks like I’m implementing in the style of *gain across the board at the start of every process block and at the start of the timer callback in my editor. I’ll report back with the findings of cpu improvements.

If I still have an unacceptable CPU hit I’m going to roll back to original pointers by reverting JUCE commit 70ddcd. I can’t afford to have any CPU increase on iOS or older devices will not keep up (Spacecraft is at it’s limit on many older devices so I don’t have any room for increase!).

Thanks to all that have contributed to this very helpful thread!

How many values are you reading?
If you’re only reading them once at the start of processBlock, a doubling of CPU use is a bit odd…

1 Like

Moving them to the start of process block is the change I’m about to make. Currently, the dereferences are scattered all throughout my code, inside audio loops, and they are legion! This must be a best practice I missed along the way, oops.