DSP Convolver performance

lcapozzi · June 14, 2022, 2:15pm

Hi guys,

I’m still running on 6.0.8, so I don’t know if that has been fixed already. Basically, I’m currently using the DSP Convolver class for short IRs (less than 200ms long), using non-uniform partitioned convolution with a head size of 1024. Performances are pretty much acceptable, but if I want to use this class for something heavier, then the performance becomes unacceptable.

I tested dsp::Convolution::NonUniform against other convolvers, either commercial or free.

Mac Pro 2013 (Xeon-based)
Sample Rate: 44100
Block Size: 256
IR: 24 seconds, 48kHz, Stereo
Ableton Live 11. CPU Meter set on Current

While others have a CPU usage around 4~5%, the JUCE’s one goes around 60%.

Adjusting the head size seems to improve the performance a little, but if I change the block size to a smaller value, I start getting glitches.

the implementation I have is pretty straightforward:

a private member:

dsp::Convolution convolver{ juce::dsp::Convolution::NonUniform { 1024 } };

I initialize the convolver as expected, passing a ProcessorSpec

Loading and process, as suggested on another thread:

void processContext(dsp::ProcessContextReplacing<float> context) noexcept
{
            ScopedNoDenormals noDenormals;
            
            // Load a new IR if there's one available. Note that this doesn't lock or allocate!
            bufferTransfer.get ([this] (myThreadedBuffer& buf)
            {
                convolver.loadImpulseResponse (std::move (buf.buffer),
                                               buf.sampleRate,
                                               dsp::Convolution::Stereo::yes,
                                               dsp::Convolution::Trim::no,
                                               dsp::Convolution::Normalise::yes);
            });
            
            convolver.process(context);
}

bufferTransfer and myThreadedBuffer are thread-safe classes to pass IR buffers to the convolver.

Is there something I’m missing or the dsp::convolver class performance is that bad?

Thanks,
Luca

PluginPenguin · June 14, 2022, 3:18pm

We have our in-house convolution class, so I can’t comment on the juce one, but that numbers look a bit off. Just to make it sure, you are testing are release build? The juce dsp classes are known to show heavy perfomance differences between debug and release builds.

lcapozzi · June 14, 2022, 4:30pm

This is built in release and with Live’s CPU meter set on Current. If set on Average, the CPU of the juce convolution measures around 16%, which is still high compared to the other plugins.

chrisboy2000 · June 14, 2022, 7:48pm

Make sure that you’re using the VDSP FFT. It should be the default on macOS, but if you’re using the fallback implementation, the performance goes bananas.
Make sure that your reference plugins do not offload the tail calculation to a background thread. Most plugins do this so the CPU % you see in Ableton is only showing the head calculation performance.

I’m using a 3rd party convolution library in HISE (mostly because I added it before the JUCE convolution was available, but I don’t think that there is a big performance difference as most of the time is spent in the FFT anyways):

However this library has the ability to use a background thread for the tail calculation and if this is enabled (aka cheating), the performance is pretty much like the “commercial” ones.

PluginPenguin · June 14, 2022, 8:25pm

My first guess would also be that the JUCE fallback FFT engine is used here. The VDSP FFT will be used by default on macOS, but since it’s an Apple framework it’s obviously not available on Windows and Linux. >ou have to link against some optimised FFT library (I think FFTW and Intel IPP are the current choices) on those platforms manually in order to get a decent performance for anything that relies on the JUCE FFT.

The dynamic choice of the best available FFT implementation at runtime might not be that obvious if you haven’t looked at the implementation and it’s one of the reasons why we use our own classes when it comes to FFT related stuff.

In any case, when facing performance issues I wouldn’t use something like a DAW CPU meter but would always run a profiler to get real detailed insight where most of the time is spent in the code.

stenzel · June 14, 2022, 8:37pm

24 seconds is quite long, if all is calculated in the audio thread with a FFT size of 1024 this could well explain the high load.
A proper realtime convolution should processes only a fraction in the audio thread, the rest in larger chunks in a separate thread that has quite relaxed timing requirement.
This is how my convolution library works.

lcapozzi · June 15, 2022, 6:46am

That’s is how it’s supposed to work… I’m expecting the tail to be processed in background via a thread or a pool of threads.

I already know that FFT library from HiFi-LoFi, but I wish to keep this as a last resort since it’ll mean rewrite the implementation for the products we have already on sale.

I guess building on Mac should enable vDSP for JUCE’s FFT, but at this point I’m not so sure. I’ll try to profile the plugin and see where the bottlenecks are.

lcapozzi · June 15, 2022, 6:59am

Here’s the profiling results

Debug

Schermata 2022-06-15 alle 08.49.191848×1342 286 KB
Release

Schermata 2022-06-15 alle 08.52.141880×1334 289 KB

I confirm that the CPU usage is high despite vDSP is used

@reuk I would like to know if that’s the expected performance of dsp::convolution or there’s something I’m doing wrong. If that’s the expected performance of this class, I have to switch on a 3rd party one, since it’s not usable for nothing more than very small IRs.

stenzel · June 15, 2022, 7:18am

Looks OK, with many blocks the complex multiplication usually burns way more cycles than the FFT.
Maybe it could be made faster by using vDSP_zvmul for complex multiplication. Also, best make sure to align your buffers to 32 byte boundaries.

lcapozzi · June 15, 2022, 7:20am

Thanks @stenzel, but this shouldn’t be something I (as end user of JUCE) should do. I expect the DSP module to do that for me. In the meantime, I’m implementing the HiFi-LoFi class. Let’s see how it goes

static-cats · November 29, 2023, 4:03pm

Hi,

Quite late to the party, but i faced the same “poor performance” issue with JUCE convolution engine with long-ish impulse response.

So i dug a bit into the code since i initially thought like you that most of the work was handled by a background thread. Well someone correct me if I missed something, but to clear out things, it seems it does all the calculation on the audio thread… Actually the documentation does not mention anywhere that the convolution offloads the work to one or several background threads. So there’s no way get anything close to “commercial products” for long IR using this convolution.

I suppose that the best solution for long IR is to roll your own background threading stuff, possibly using the github repo mentioned by @chrisboy2000 as a starting point…

Topic		Replies	Views
Convolution Reverb with Juce Convolution class (High CPU usage,... why) Development	20	8368	August 15, 2021
Use of dsp::Convolution in an audio plugin? General JUCE discussion	12	1043	February 22, 2023
How do I implement a non-uniform partitioned convolution without extreme CPU Usage? General JUCE discussion	4	182	May 27, 2025
Convolution is slow on mac os Audio Plugins	1	414	May 7, 2019
Performance expectations? (Glitching in dsp::Convolution in standalone plugin) Android	5	1302	June 28, 2019

DSP Convolver performance

Purchase

Discover

Learn

Support

About

Events

DSP Convolver performance

Related topics

Purchase

Discover

Learn

Support

About

Events