DSP Convolver performance

Hi guys,

I’m still running on 6.0.8, so I don’t know if that has been fixed already. Basically, I’m currently using the DSP Convolver class for short IRs (less than 200ms long), using non-uniform partitioned convolution with a head size of 1024. Performances are pretty much acceptable, but if I want to use this class for something heavier, then the performance becomes unacceptable.

I tested dsp::Convolution::NonUniform against other convolvers, either commercial or free.

Mac Pro 2013 (Xeon-based)
Sample Rate: 44100
Block Size: 256
IR: 24 seconds, 48kHz, Stereo
Ableton Live 11. CPU Meter set on Current

While others have a CPU usage around 4~5%, the JUCE’s one goes around 60%.

Adjusting the head size seems to improve the performance a little, but if I change the block size to a smaller value, I start getting glitches.

the implementation I have is pretty straightforward:

  • a private member:
dsp::Convolution convolver{ juce::dsp::Convolution::NonUniform { 1024 } };

I initialize the convolver as expected, passing a ProcessorSpec

Loading and process, as suggested on another thread:

void processContext(dsp::ProcessContextReplacing<float> context) noexcept
{
            ScopedNoDenormals noDenormals;
            
            // Load a new IR if there's one available. Note that this doesn't lock or allocate!
            bufferTransfer.get ([this] (myThreadedBuffer& buf)
            {
                convolver.loadImpulseResponse (std::move (buf.buffer),
                                               buf.sampleRate,
                                               dsp::Convolution::Stereo::yes,
                                               dsp::Convolution::Trim::no,
                                               dsp::Convolution::Normalise::yes);
            });
            
            convolver.process(context);
}

bufferTransfer and myThreadedBuffer are thread-safe classes to pass IR buffers to the convolver.

Is there something I’m missing or the dsp::convolver class performance is that bad?

Thanks,
Luca

We have our in-house convolution class, so I can’t comment on the juce one, but that numbers look a bit off. Just to make it sure, you are testing are release build? The juce dsp classes are known to show heavy perfomance differences between debug and release builds.

This is built in release and with Live’s CPU meter set on Current. If set on Average, the CPU of the juce convolution measures around 16%, which is still high compared to the other plugins.

  1. Make sure that you’re using the VDSP FFT. It should be the default on macOS, but if you’re using the fallback implementation, the performance goes bananas.
  2. Make sure that your reference plugins do not offload the tail calculation to a background thread. Most plugins do this so the CPU % you see in Ableton is only showing the head calculation performance.

I’m using a 3rd party convolution library in HISE (mostly because I added it before the JUCE convolution was available, but I don’t think that there is a big performance difference as most of the time is spent in the FFT anyways):

However this library has the ability to use a background thread for the tail calculation and if this is enabled (aka cheating), the performance is pretty much like the “commercial” ones.

1 Like

My first guess would also be that the JUCE fallback FFT engine is used here. The VDSP FFT will be used by default on macOS, but since it’s an Apple framework it’s obviously not available on Windows and Linux. >ou have to link against some optimised FFT library (I think FFTW and Intel IPP are the current choices) on those platforms manually in order to get a decent performance for anything that relies on the JUCE FFT.

The dynamic choice of the best available FFT implementation at runtime might not be that obvious if you haven’t looked at the implementation and it’s one of the reasons why we use our own classes when it comes to FFT related stuff.

In any case, when facing performance issues I wouldn’t use something like a DAW CPU meter but would always run a profiler to get real detailed insight where most of the time is spent in the code.

24 seconds is quite long, if all is calculated in the audio thread with a FFT size of 1024 this could well explain the high load.
A proper realtime convolution should processes only a fraction in the audio thread, the rest in larger chunks in a separate thread that has quite relaxed timing requirement.
This is how my convolution library works.

1 Like

That’s is how it’s supposed to work… I’m expecting the tail to be processed in background via a thread or a pool of threads.

I already know that FFT library from HiFi-LoFi, but I wish to keep this as a last resort since it’ll mean rewrite the implementation for the products we have already on sale.

I guess building on Mac should enable vDSP for JUCE’s FFT, but at this point I’m not so sure. I’ll try to profile the plugin and see where the bottlenecks are.

Here’s the profiling results

  1. Debug

  2. Release

I confirm that the CPU usage is high despite vDSP is used

@reuk I would like to know if that’s the expected performance of dsp::convolution or there’s something I’m doing wrong. If that’s the expected performance of this class, I have to switch on a 3rd party one, since it’s not usable for nothing more than very small IRs.

Looks OK, with many blocks the complex multiplication usually burns way more cycles than the FFT.
Maybe it could be made faster by using vDSP_zvmul for complex multiplication. Also, best make sure to align your buffers to 32 byte boundaries.

Thanks @stenzel, but this shouldn’t be something I (as end user of JUCE) should do. I expect the DSP module to do that for me. In the meantime, I’m implementing the HiFi-LoFi class. Let’s see how it goes