Aren't the built-in AudioBuffer operations (such as applyGain(...)) a little verbose?

I’m learning about using AudioProcessorParameter classes (such as AudioParameterFloat) in order to have the GUI elements of my application feed into my AudioProcessor. In the related tutorial (Tutorial: Adding plug-in parameters) it wants me to feed my gain parameter into AudioBuffer.applyGain.

But isn’t that bad programming style? Because applyGain applies itself to every sample in the buffer, meaning it will iterate through every sample in the buffer in order to do the multiplication. And unless it’s not already clear, we are already looping through every sample in order to do our DSP.

Isn’t it better to just throw a sample * gain right in the processBlock loop rather than doing our DSP, then doing another round of DSP via AudioBuffer.applyGain (*gain);?

Or am I misunderstanding how applyGain works?

If it is indeed better to multiply right inside the processBlock AudioBuffer loop, then get() the current parameter value would be the preferred way of doing it, right? (sample * gain->get())

Cheers

It’s very very close the same amount of work, unless the compiler can figure out some really smart way to optimize the combined “your DSP” and the gain change code. (I would personally probably do the gain change calculation directly in my code, because that would also allow smoothing the gain change amount per sample.)

When you say very very close, do you mean as very very close as looping through your buffer again is or is something special happening? Perhaps with FloatVectorOperations::multiply (Which i know is some kind of SIMD optimized operation)

Even if it were super close, isn’t the principle of super fast code, and nothing uncessary in the audio thread the winning factor here? Also, on that topic, there is an if statement inside applyGain() and I’m told conditional statements inside your audio thread should be avoided whenever possible.

Re-running a loop over the samples is just incrementing the loop counter and doing the end condition check. The actual DSP and the memory accesses involved are likely to be much more expensive in comparison. (Maybe not in the case of a simple gain change implemented by just a multiplication, but in practice you will also need to smooth out the gain parameter changes, and even the simplest smoother will involve a lot more calculations.)

Anyway, you shouldn’t really be worrying about these kinds of low level details at this point. You will just get stuck and won’t get anything interesting done.

1 Like

Here’s the source for applyGain. There’s some additional logic to check the gain value and to use vector multiplications.

And unless it’s not already clear, we are already looping through every sample in order to do our DSP

The looping part is pretty fast. But to be sure you should benchmark dsp + gain in the same loop vs. dsp by itself followed by applyGain. Unless your DSP is already vectorized (which it won’t be by default), the second loop inside applyGain is probably faster, since it is vectorized.

Out of curiosity I did a test program that filters the audio with a lowpass filter and changes the gain.

Here’s some results I got with my Windows Intel i7 system, 64 bit, compiled with Visual Studio 2017 :

Separate loops took 963.784 milliseconds
Separate loop and AudioBuffer::applyGain took 963.844 milliseconds
Combined processing took 900.255 milliseconds

The test data used was 30 minutes of stereo audio. As you can see, all the results are pretty close, but as you suspected, it can in this particular case pay off to do the gain change in the same loop with the filtering. The difference is very small though. Interestingly, Juce’s vectorization with AudioBuffer::applyGain either doesn’t work or matters very little in the end, compared to the low pass filtering.

The test program code (if someone finds a mistake, please do point out :slight_smile: )

edit : I added a 4th processing method, using processSamples of the filters and then using AudioBuffer::applyGain. This came in fastest so far, around 640 milliseconds.

Ah… puny humans attempting to understand and predict how well a CPU will run some code!

Optimisers and CPUs are just so ridiculously complex now that the best that anyone can do is to come up with a few variations that probably perform differently, and then measure it.

But even when you physically measure your code’s performance, if you then change any of these:

  • the CPU
  • the compiler
  • the compiler settings
  • the amount of cache
  • the amount of memory
  • the OS scheduler
  • the other processes running at the same time
  • the code that ran just before your function was called
  • the parameters your function takes

…then the whole thing might behave differently,

Writing code that’s fast on all platforms is ridiculously hard… (until everyone is writing in SOUL and that problem goes away :slight_smile: )

And beginners should always remember the 3 rules of optimisation: http://wiki.c2.com/?RulesOfOptimization

1 Like

Wow, I inspired the almost mythical man himself to reply! (Not sure if that’s necessarily a good thing! :sweat_smile:)

Here’s a microbenchmark with the same idea. I tried with explicit SIMD but got about the same results. A little contrived to be sure, but the concept is reliable everywhere I’ve used it.

@jules it’s not crazy to reason that splitting into a non-vectorized/vectorized head/tail loop is going to be faster than putting something that can be vectorized into a loop that isn’t. It’s not black magic…

Complexity is the square root of all evil. – /usr/bin/fortune

2 Likes

No, not crazy at all!

However…

a) If you write something like that in a simple way, the compiler actually stands a good chance of being able to generate the vectorised version for you.

b) I can imagine lots of situations where doing two passes of vectorised operations on a buffer would be slower than a single non-vectorised pass because:

  • efficient use of cache-slots can easily outweigh CPU cycles as the bottleneck (but this will depend on lots of factors that you don’t know such as the actual cache size and line layout)
  • there are other ways to vectorise than just doing a simple operation to a sequential buffer - e.g. if you do an inline per-sample multiplication, the compiler will often be able to merge/pipeline it with other operations and make it essentially free.

My point is really just that unless your day job is working on the LLVM optimiser, then don’t trust your intuition, even where something seems “obvious”!

My point was that making it easier for the compiler to auto-vectorize (or using explicitly vectorized loops, like your own applyGain) will be faster than doing it inside a non-vectorized loop. Cache doesn’t really come into play, the data is in the same place regardless of how you loop through it.

I do think you have far too much faith in the compiler for audio DSP… half my day job is writing DSP that is more optimized than LLVM. The optimizer isn’t magic. Hand rolled SIMD is pretty commonplace.

Sometimes the compiler does magic. Sometimes it’s utterly dumb!

And sure, hand-rolled SIMD is commonplace, but I think people often assume that SIMD is the holy grail of making something go fast, when in fact it’s just one of many options you need to consider.

The OP question was about “good style” in terms of performance, and all I’ve been trying to get across in my answers here is that the whole issue is so complicated that it’s probably not worth worrying about how optimally you think the code will run until you’ve got a real product with a real, measurable performance problem that you can try to improve empirically. Avoid dumb stuff, obviously, but micro-optimising too early is usually a waste of effort.

2 Likes

Similar to, but not exactly, what Donald Knuth said 45 years ago!

I’d only really came across the bold section until today, and in context it has a slightly different meaning to how I’d understood it before:

“Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.”

2 Likes