Speed up code for upsampled processing in Plugin

Hey there,

I’m trying to implement a plugin using the rt-wdf library by Maximilian Rest to create a analog-sounding distortion (and because I’m a litte intrigued by the method). The processing of samples with the high-level functions provided by the library is pretty straight-foreward and works for the example Fender-Tonestack-circuit (in real time) on my machine, but - as one would expect - introduces aliasing. I’m therefore using the Oversampling-class from the juce::dsp module to oversample the audioBuffer before processing the samples with the wave-digital-filter (also previously described and implemented in C++/JUCE by @maxprod).

Unfortunately, with the included oversampling, the plugin seems to lose it’s real time capability. Even with 2 or 4 times oversampling, the plugin causes the audio-playback in Reaper to stutter, having to stop and process every half a second or so. Also the CPU-load skyrockets to about 75%.

Here is the code from the prepareToPlay and process methods, OVERSAMPLING_FACTOR being a macro defined in the header-file and oversampler defined as:

juce::dsp::Oversampling<float>* oversampler = new juce::dsp::Oversampling<float> (getTotalNumInputChannels(), OVERSAMPLING_FACTOR, juce::dsp::Oversampling<float>::FilterType::filterHalfBandFIREquiripple, true, false);

also in the header-file.

void RtwdfPluginAudioProcessor::prepareToPlay (double sampleRate, int samplesPerBlock)
{
    juce::dsp::ProcessSpec spec;
    spec.sampleRate = sampleRate;
    spec.maximumBlockSize = samplesPerBlock;
    spec.numChannels = getTotalNumInputChannels();
    oversampler->reset ();
    oversampler->initProcessing(spec.maximumBlockSize);


    thisWdfTree = new wdfTonestackTree();
    thisWdfTree->initTree();
    thisWdfTree->setSamplerate(OVERSAMPLING_FACTOR * this->getSampleRate());
    thisWdfTree->adaptTree();
}
void RtwdfPluginAudioProcessor::processBlock (juce::AudioBuffer<float>& buffer, juce::MidiBuffer& midiMessages)
{
    juce::ScopedNoDenormals noDenormals;
    auto totalNumInputChannels  = getTotalNumInputChannels();
    auto totalNumOutputChannels = getTotalNumOutputChannels();

    float bass = apvts.getParameter("BASS")->getValue();
    float mid = apvts.getParameter("MID")->getValue();
    float treble = apvts.getParameter("TREBLE")->getValue();

    thisWdfTree->setParam(0, bass);
    thisWdfTree->setParam(1, mid);
    thisWdfTree->setParam(2, treble);

    for (auto i = totalNumInputChannels; i < totalNumOutputChannels; ++i)
        buffer.clear (i, 0, buffer.getNumSamples());

    auto audioBlock = juce::dsp::AudioBlock<float>(buffer);
    auto context = juce::dsp::ProcessContextReplacing<float>(audioBlock);
    //oversampling:
    auto oversamplingAudioBlock = oversampler->processSamplesUp(context.getInputBlock());

    for (int channel = 0; channel < totalNumInputChannels; ++channel)
    {
        auto* channelPtr = oversamplingAudioBlock.getChannelPointer(channel);
        
        for (int sample = 0; sample < oversamplingAudioBlock.getNumSamples(); sample++)
        {
            thisWdfTree->setInputValue(*(channelPtr+sample)); //access AudioBlock-data via pointer 
            thisWdfTree->cycleWave();
            *(channelPtr+sample) = thisWdfTree->getOutputValue(); //same access as two lines above but in reverse
        }
    }
    oversampler->processSamplesDown(context.getOutputBlock());
}

I’m not sure, what I could do to speed up the whole thing, or even if there isn’t anything wrong with the process that causes the audio to stutter. The switch between audioBlock and single-sample processing could also be introducing errors, maybe? I’m not yet ready to accept that the WDF-method itself is to computationally expensive, since it’s been implemented before, and was said to be real-time compatible.

If anyone could help me by suggesting speed-up-methods or pointing out any mistakes in the code in this regard, I would appreciate it.

Thanks in advance people!

Looking at your code, I see that you are not using smart pointers – which you should do for safety reasons – and that you fetching the parameters via their string identifiers from the apvts in every callback instead of storing the pointer to the underlying atomic once. The first one isn’t a performance issue, the second one is unrelated to your oversampling, so take them as side notes :wink:

Regarding your main problem: Are you sure that you are testing a release build? Never judge performance by debug builds. And if you see this huge performance impact with a release build, the obvious next step is to profile your code. This will reveal where most time is spent and what to optimise first. I don’t know the library you are using, so I cannot tell anything about the performance that you usually could expect from it, but maybe profiling will reveal hotspots that you can tweak in the library code yourself :slight_smile:

@PluginPenguin thanks! Profiling sounds like a good idea. Do you have any recommendations on what software to use for JUCE plugins? I’ve never done any profiling before, so I’m pretty clueless at this point.

Also thanks for the sidenotes, always good to hear some best-pratices etc.!

I’m on Linux (Arch/Manjaro) by the way

Good advice from PP there, but you’ll also want to avoid doing memory allocations in processBlock and look for opportunities to vectorise your code (though it looks like the wdf tree implementation only allows per-sample processing unfortunately).

If you haven’t seen it before, then have a good read of this: Ross Bencina » Real-time audio programming 101: time waits for nothing

It’s been a while since I profiled on Linux. I used Intel VTune back then but I’m not sure if it works on Arch – I successfully used it on Ubuntu or Centos, don’t remember exactly. And it’s x86_64 only of course, but I guess that you are not working on an ARM machine?

Yes, it’s an ARM machine … also VTune is available from the Arch user repository, so I’ll try to use it.

Did they add ARM support to VTune in the meantime?

That’s my understanding too, otherwise I would be happy to use the audioBlock as intended and get around the sample by sample stuff.
Thanks for the further reading! Will look into this

should have proof-read that reply :wink: , it ISN’T an ARM machine, it’s x86_64. Sorry for the confusion

Ah I see, thanks for clearing that up :wink:

Hello @copypastecat !

I read about the WDF library by M. Rest and it appears really interesting. However, if I read correctly, it makes use of armadillo, which is a very useful library to manage vector and matrix operations in C++ but it is expensive in terms of performances. (Probably also @PluginPenguin was asking about the performances of the external libraries). It is very useful to make theoretical studies about your algorithms, before the real-time implementations. IIRC, the eigen library is similar and less expensive than armadillo.

Nevertheless, I suggest you to go inside the WDF implementation and try to implement it, without the use of external libraries, directly in C++. IMO, that’s not a problem, in particular if you’re not working with multi-port scattering junctions.

Could I also ask where the nonlinear process takes place? IIANM, the fender tonestack circuit is a kind of linear filtering process

Hi @karota

you’re right, the Tonestack is linear, therefore the wdf implementation should also be linear (I don’t see any reason why the wdfs shouldn’t preserve linearity, although I’m not exactly an expert…), so the distortion I’m experiencing is indeed maybe already a performance issue, even wihout the oversampling. But since the goal is to use it for distortion with the tube-circuits, I included the oversampling-stage right from the start without thinking about that and just kind of jumped to conclusion when I heard the distortion (with no oversampling) in Reaper.

I will try to omit the use of armadillo, which at least for the Tonestack shouldn’t be a problem and see if it helps, thanks!

1 Like

Ehi @copypastecat !

Yeah, that’s nice to hear someone that studies WDFs.
Before going towards multi-port nonlinearities (such as Tubes, BJTs…), I suggest you to have a clear knowledge of DFLs, adaptation, series, parallel and multi-port junctions. This concepts have been well explained in the literature (papers by Werner, D’Angelo, Bernardini for instance).
After that, try to firstly implement systems with one-port nonlinear elements, such as diodes, since you could not need iterative solvers in this case and you can also apply some Anti-Aliasing technique to reduce computational cost (papers by Parker and Albertini explain the Antiderivative Anti-Aliasing mechanism. Here you can find some implementations: GitHub - jatinchowdhury18/ADAA: Experiments with Antiderivative Antialiasing ).
At this point I think that you can deal with systems with multi and/or multi-port nonlinear elements, that in most cases require high computational costs.

Hope all is clear!

1 Like

Alright, quick update if anyone is interested or reads this sometime later with a similar problem:

  1. @PluginPenguin was right, because of Reaper scanning an old folder, I was indeed testing a debug build :grimacing:. Switching to the release build solved the problem with CPU-load and oversampling. Now the Plugin runs smoothly, even at 16x oversampling. Out of interest, I changed the libraries LAG-engine to Eigen/Dense (which as @karota already pointed out is supposed to be a lot faster, see also this blogpost). I listed the resulting CPU-loads in Reaper for both Armadillo/Eigen below. As it almost doesn’t differ for both libraries and seems managable in general, I assume that the linear algebra operations aren’t a huge factor in the performance all together.

  2. HOWEVER: the strange distortion is still there. It seems to be especially strong in the very low frequencies. Also, the distortion is a lot heavier on the right channel than the left. Taking a look at the output of the plugin for a sine sweep confirms that. I took the time to quickly measure the plugins frequency response (using Reapers white-noise generator, and octaves FFT/psd capabilities) and compare it to the FR ltspice produces for the real circuit. It seems that the plugin has a pretty strong resonance at the maximum around 50 Hz. Some plots can be found below (sorry for the poor axis-scaling / labeling, but you get the gist…).

If anyone has any ideas on what might cause that low-end-distortion, please let me know. The difference between right and left channel leads me to believe, that it’s maybe leftover chunks of data somewhere inside the memory that the wdf-library uses to compute the output, that distort the signal. I’ll look in to that, and maybe post here again, if I find out what causes the distortion or if I’m able to fix it.

Performance of the Plugin with the two LAG-Libraries:

times oversampling CPU-load Armadillo CPU-load Eigen/Dense
1 1.3% 1.35%
2 2.75% 2.5%
4 5.2% 5.1%
8 10% 9.5%
16 11.8% 11.5%

WDF-implementations output when inputting a pure sine-sweep (top: left channel, bottom: right channel):

WDF-implementations frequency-response (all dials 100%):

Physical circuits frequency response (all dials 100%):

tonestack-ltspice-fr-cens.pdf (75.3 KB)

Hi @copypastecat,

first of all, you need to allocate memory of each filter’s state for each processing channel! As far as WDF tree is concerned, I suppose that you need to allocate a WDF tree for each channel!

Read here: #1 most common programming mistake that we see on the forum

1 Like

Jup, you’re right. Using a separate wdf-tree per channel solved the problem. No more distortion of any kind. Thanks!

1 Like