Hi there! I’m building my first audio plugin, and have been learning a ton over the past few moths about JUCE and C++. I’m currently on the performance-optimization stage of development for my algorithms, a few of which are not time-invariant, and I’m wondering about how to optimize these processes.
For time-invariant processes, my best strategy has been to use FloatVectorOperations (or the built-in methods that use these) on each channel pointer of the input block. However, for time-dependent processes, you can’t do this. For example, in one part of the plugin I built a simple compressor that follows the envelope of the signal and applies gain-reduction accordingly. How does one optimize a system like this?
One idea I had was to use FloatVectorOperations, except over channels, rather than over samples. Ie. I’d loop through the buffer samples and create a std::vector with two elements, each one being a sample for a given channel at that time. For each process in my algorithm, I’d use FloatVectorOperations on these vectors. Something like this:
for (int i = 0; i < numSamples; ++i)
for (int c = 0; c < numChannels; ++c)
FloatVectorOperations::addWithMultiply(samples.data(), otherVector.data(), numChannels);
I found that this didn’t provide a performance improvement, though I’m suspecting that the overhead from creating all of those vectors is doing something to reduce that performance.
Is this a strategy that people use? If not, why not? Are there any other strategies for performance-optimizing a time-dependent process like a compressor?
To make informed decisions about possible performance optimisatio approaches you should start learning a bit about why accelerator functions like
juce::FloatVectotOperations, Intel IPP or functions from the Apple Accelerate Framework can even make performance gain possible at all first. By the way, that applies to all kinds of performance optimisations.
One of the key optimisations these functions use is the usage of SIMD operations in the background. SIMD operations are special CPU instructions where the same instruction is executed on 4 or more float values that are adjacent in memory while only taking up the execution time of the equivalent scalar instruction that would have computed the same operation on a single value. So a speed up of approximately 4. Still this comes at a cost of loading the values in dedicated CPU registers and the overhead of calling a function and a bit more. For large enough buffers this overhead is small compared to the speedup of the actual calculation, but for a 2 element vector this will probably not be the case.
Still you can use SIMD acceleration in cases like that, however you have to use SIMD manually in that case. Well semi-manually. JUCE comes with the
juce::dsp::SIMDRegister wrapper that allows you to access the architecture specific SIMD functionality in a nice cross-platform way. The idea is that you load up your two channel samples in a single register and then use that register in your processing code. As ARM Neon and Intel SSE have SIMD registers that can hold up to 4 floats, this straightforward approach works for plug-ins with up to 4 I/O channels. This tutorial is probably a good introduction to that.
Last but not least, before starting with optimisation like this, there is one big optimisation that you should do which is: Never create new
std::vector instances in your processing callback.
std::vector usually stores its data on the heap, using heap memory allocation (
new etc.). You probably already know that you should never call them in performance critical code since they can vary massively in their return time. Creating a
std::vector does the same, only hidden (implementation specific small vector optimisations taken aside for now). This does not mean you should not use/access
std::vector instances in the processing, only creating and resizing should always happen during prepare and these vectors should be members of your classes that are re-used over and over without being re-created.
Another thing to consider when tuning code is that you should only tune a certain segment of the code that is known to you as a performance bottleneck. This means, the starting point for optimisation is always profiling. After you have done some optimisation profile again and see if the impact of the functions that you modified has been lowered in the profiling result.
And of course only judge performance on release builds