AudioBuffer<double/float> performance question (Stack/Heap)

Hello colleagues,

Sorry if this question has already been (if so let me know). I just not found. As far as I understood from the old forum threads and from the code that AudioBuffer allocate memory on HEAP (which is many times slower).

Is this approach used in all modules of your synthesizer/plugin? Or the AudioBuffer is something that better use only on initial calls and your modules uses your own buffers like double[bufferSize][channelSize] or std::array<double, n>?

I decided to rewrite my code to use AudioBuffer, but my observations show the following. (4 oversampling x 4 osc x 16 unison x 16 voices + 2 heavy moog filters)

  1. AudioBuffer(2 ch, 256 x 4) - CPU Load: 183% for single core. (and no sound hear)
  2. double[256 x 4][2 ch] - CPU Load: 37-38% (sounds good)
  3. AudioBuffer(2 ch, 256 x 4) [but with 2 unison] - CPU Load: 84%

As you can see the difference is huge. So I prefer to use c array/std::array. But maybe I wrong and using AudioBuffer correctly? I used addSample(), setSample(), getSample(), getArrayOfWritePointers, but same results.

Test Devices:
iPad 12.9 inch A12X
MacBook m1 (10-15% less load)
-ffast-math
-Ofast
-flto

Thank you.

the reason why audioBuffer is on the heap is because even though the number of channels doesn’t change in the lifetime of the plugin it could be that the maximum number of samples per block or the sampleRate changes. if you use std::array you have to make sure to find a size that would work for all these combinations of setups and theirfore be really huge. all samples in the audioBuffer are right next to each other in memory so it’s probably already as good as possible

@Mrugalla Thanks for reply. Yes the issue with large samples I understand the app will crash in case if I try to use let say 4096 samples. But the good news that the performance stay the same even if I use 2 samples (in case with std::array or c array).

So probably I will stay with something like 8 samples (std::array<std::array<double, 2>, 8> or double[8][2]) and then just loop to the necessary amount of circles (for example 256/8=32 circles) But if the incoming buffer will be less than 8 samples, then I just leave the other cells non touched.

And there is no any chance to give same performance with AudioBuffer<>, is that right? I was just interested in hearing other people’s experiences, since I recently just started learning the Juce framework and synthesis in general.

While using heap memory vs stack memory has a (very minor!) additional cost of one extra pointer access, you pay that cost only once when iterating over it, and there’s no way it would make your code 5 times slower.

You probably made some mistake in how your work with buffers. If you can post some of your code here I’m sure people will be happy to help you fix it. :slight_smile:

Re-reading the OP made me realize something: Are you allocating the AudioBuffers during ProcessBlock?

What you should be doing is preallocating those buffers during prepareToPlay (using the correctly sized buffer size, number of channels, etc), where heap allocation is fine, and only read/write to them during processBlock().

1 Like

It is impossible to measure how long a heap allocation takes. It can be super fast, but it can take very long: it is indeterministic. That’s why it is a no-go on a realtime thread.

Just reiterating the difference of heap and stack memory:

  • heap memory is accessible from all threads of a process, that’s why the memory management needs to synchronise. This synchronisation can take an indefinite amount of time, if a low priority thread was unlycky enough to be put to sleep while holding a lock for allocation
  • stack memory is thread exclusive, so no synchronisation. Additionally the allocations are last in first out (LIFO), so there is no fragmentation possible.

References: Realtime 101 by @dave96 and @fr810:

3 Likes

@eyalamir I’m allocating buffer only in the prepareToPlay(), not in the realtime thread.

Here is my code with c array:

   // 256 * 4 oversampling with 2 channels
    typedef double SynthBuffer[constants::BUFFER_SIZE_OVERSAMPLING][constants::CHANNELS_COUNT];
...
inline void processBlock(OscState &oscState, SynthBuffer &buffer, int length) {

            auto maxFreq = sampleRate_ * 0.5;

            SynthBuffer preBuffer;
            for (int i = 0; i < length; ++i) {
                preBuffer[i][0] = 0.0;
                preBuffer[i][1] = 0.0;
            }

            for (int i = 0; i < length; ++i) {

                // Some ramp and matrix tasks
                processRamp();
                updateState(oscState);

                // Get frequency
                auto freq = oscState.freq;
                applyFineAndSemi(freq);
                freq = tools::clamp(freq, constants::MIN_FREQ, maxFreq);

                // Determinate octave for wavetable
                auto octave = tools::getOctaveByFreq(freq);
                uint32_t sampleOffset = 3 * constants::SAMPLE_LENGTH; // morphIndex * sampleCount;

                // Get wave from Wavetable
                auto *sample = oscWaveShape.sample_[octave] + sampleOffset;

                for (int u = 0; u < numOfUnison_; ++u) {
                    auto pos = oscState.samplePosition[u];
                    double sound = sample[uint32_t(pos)];
                    preBuffer[i][0] += sound * unisonPanMultipliers_[u][0]; // make our unison stereo
                    preBuffer[i][1] += sound * unisonPanMultipliers_[u][1];
                }

                // Move sample position in separate loop. +20-30% of performance
                for (int u = 0; u < numOfUnison_; ++u) {
                    auto &pos = oscState.samplePosition[u];
                    pos = pos + (freq * freqMultipliers_[u]); // * unison detune
                    if (pos >= constants::SAMPLE_LENGTH_DOUBLE) {
                        pos = pos - constants::SAMPLE_LENGTH_DOUBLE;
                    }
                }
            }

            // apply gain
            auto gain = oscState.gain;
            for (int i = 0; i < length; ++i) {
                buffer[i][0] += preBuffer[i][0] * gain;
                buffer[i][1] += preBuffer[i][1] * gain;
            }

        }

Code with AudioBuffer:

void prepareToPlay(int samplesPerBlockExpected, double sampleRate) {
            sampleRate_ = sampleRate;
            delete preBuffer_;
            preBuffer_ = new AudioBuffer<double>(constants::CHANNELS_COUNT, samplesPerBlockExpected);
        }

        inline void processBlock(OscState &oscState, AudioBuffer<double> &buffer) {

            auto maxFreq = sampleRate_ * 0.5;
            int length = buffer.getNumSamples();

            preBuffer_->clear();

            for (int i = 0; i < length; ++i) {

                // Some ramp and matrix tasks
                processRamp();
                updateState(oscState);

                // Get frequency
                auto freq = oscState.freq;
                applyFineAndSemi(freq);
                freq = tools::clamp(freq, constants::MIN_FREQ, maxFreq);

                // Determinate octave for wavetable
                auto octave = tools::getOctaveByFreq(freq);
                uint32_t sampleOffset = 3 * constants::SAMPLE_LENGTH; // morphIndex * sampleCount;

                // Get wave from Wavetable
                auto *sample = oscWaveShape.sample_[octave] + sampleOffset;

                for (int u = 0; u < numOfUnison_; ++u) {
                    auto pos = oscState.samplePosition[u];
                    double sound = sample[uint32_t(pos)];
                    preBuffer_->addSample(0, i, sound * unisonPanMultipliers_[u][0]);  // make our unison stereo
                    preBuffer_->addSample(1, i, sound * unisonPanMultipliers_[u][1]);
                }

                // Move sample position in separate loop. +20-30% of performance
                for (int u = 0; u < numOfUnison_; ++u) {
                    auto &pos = oscState.samplePosition[u];
                    pos = pos + (freq * freqMultipliers_[u]); // * unison detune
                    if (pos >= constants::SAMPLE_LENGTH_DOUBLE) {
                        pos = pos - constants::SAMPLE_LENGTH_DOUBLE;
                    }
                }
            }

            // apply gain
            auto gain = oscState.gain;
            for (int i = 0; i < length; ++i) {
                buffer.addSample(0, i, preBuffer_->getSample(0, i) * gain);
                buffer.addSample(1, i, preBuffer_->getSample(1, i) * gain);
            }

        }


    private:
         ...
        AudioBuffer<double> *preBuffer_{};

I tried also to work with “getArrayOfWritePointers()” or “getArrayOfReadPointers()”, but same result.

And important note. If I don’t use “preBuffer” but directly adding sample to the incoming buffer from function, then the CPU even with c array - 120%.

My wavetable is 10 x 128 x 1024 size. i.e. 10 octaves, in one octave - 128 waves with length 1024 sample. p.s. even if I will use simple double[1024] the performance is same, so for sure, the issue not with wavetable

Without analyzing the insides of your performance problems too much, a couple of things:

  1. You don’t need to store AudioBuffers as pointers. AudioBuffers themselves handle the memory of what’s inside them. Allocating them with a pointer adds yet another heap indirection which isn’t needed.
//As a class member:
AudioBuffer<double> preBuffer;

//In prepareToPlay:
preBuffer.setSize(numChannels, numSamples);
  1. When you’re calling clear() you’re actually clearing the maximum size of the pre allocated array every time, which is slower than just clearing the size of what’s about to be used in the current process block, so this should be faster:
preBuffer.clear(0, buffer.getNumSamples());
  1. addSample() every sample is very slow, which I assume is the main cause of your performance problems. Internally, it sets an atomic variable so it can’t be optimized by the compiler.

Instead, use addFrom which does a vector operation.
Generally speaking, whenever you can do an operation over an entire buffer/array, it’s much faster than doing per sample operations and then bouncing between channels.

Check out FloatVectorOperations.

3 Likes

@eyalamir Strange for me there is no difference. I applied the following things:

//As a class member:
AudioBuffer<double> preBuffer;

//In prepareToPlay:
preBuffer.setSize(numChannels, numSamples);

// add from
buffer.addFrom();

Even when I use FloatVectorOperations::add(…, length); the CPU even more loaded by 7-10%. So the basic for (auto &b : buffer) or for (int I =0;…) is faster for me. Maybe it’s an ARM of a quirk not sure.

Regarding the AudioBuffer the load still more than 100% cpu even with “addFrom” or getArrayOfWritePointers()

Interesting.

It probably requires more investigating in what exactly is different in your memory access pattern between the two versions. I’m willing to bet there’s a hidden detail here in your exact usage that’s way more different than some extra pointer dereference.

You are comparing those performance issues in release builds, right?

@eyalamir yes in release mode. Perhaps the issue that I’m using 1024 (256 *4) samples length. As far I know accelerate framework from Apple can handle 16 samples.

A similar issue was discussed in this thread - No performance improvement with FloatVectorOperations

Test Devices:
iPad 12.9 inch A12X
MacBook m1 (10-15% less load)
-ffast-math
-Ofast
-flto

The size of the buffer is fine. FVO knows to handle different buffer sizes and divide them correctly to smaller buffer lengths that can fit in the register.

FVO should still be fast, even tough sometimes there are advantages to regular for loops vs the manual vectorization, as the compiler can see through more than one call which it can’t with manual FVO.

But anyway, you can also use getArrayOfWritePointers(), which should also give almost identical result to std::array/C array, and if it doesn’t I’d take a close look at what you’re doing with each AudioBuffer access, to make sure it’s correct and not doing wasteful operations.

2 Likes

I dare to say that after the optimiser had it’s go this is all micro optimisation you are doing here.

But what is actually bad is, that your AudioBuffer* is actually leaking: prepareToPlay can be called several times, each time you are creating a new AudioBuffer object without deleting the previous.

Like Eyal said: it is not necessary and in your case even harmful, not for performance but for code correctness.

4 Likes

@Daniel so far the prepareToPlay was called just one time.

prepareToPlay: samples: 256, rate: 48000.000000 <- 1 time
2021-08-12 04:05:05.138151+0300 Synth[1475:223820] Metal GPU Frame Capture Enabled
2021-08-12 04:05:05.141200+0300 Synth[1475:223820] Metal API Validation Enabled

@eyalamir
Yeh, I tried to use the getArrayOfWritePointers() before. But as mentioned in previous posts still more than 100% cpu.

But I changed the code slightly. Added a float[2] buffer before I do foreach with 16 unison. And then bounce back a sum of unison to the AudioBuffer. In this way I get 41-42% cpu. Which is close to the version with c array or std::array.

inline void processBlock(OscState &oscState, AudioBuffer<float> &buffer) {

            int length = buffer.getNumSamples();

            float positions[numOfUnison_];
            for (int u = 0; u < numOfUnison_; ++u) {
                positions[u] = oscState.samplePosition[u];
            }

            auto *b = buffer.getArrayOfWritePointers();

            for (int i = 0; i < length; ++i) {

                // Some ramp and matrix tasks
                processRamp();
                updateState(oscState);

                auto freq = oscState.freq;
                applyFineAndSemi(freq);
                freq = tools::clamp(freq, constants::MIN_FREQ, maxFreq_);

                // Determinate octave for wavetable
                auto octave = tools::getOctaveByFreq(freq);
                uint32_t sampleOffset = 3 * constants::SAMPLE_LENGTH; // test morphIndex #3 sampleCount;

                // Get wave from Wavetable
                auto *sample = oscWaveShape.sample_[octave] + sampleOffset;

                float oscSum[constants::CHANNELS_COUNT];
                for (int u = 0; u < numOfUnison_; ++u) {
                    float sound = sample[uint32_t(positions[u])];
                    oscSum[0] += sound * unisonPanMultipliers_[u][0];
                    oscSum[1] += sound * unisonPanMultipliers_[u][1];
                }

                b[0][i] = oscSum[0] * oscState.gain;
                b[1][i] = oscSum[1] * oscState.gain;

                // move sample position ++
                for (int u = 0; u < numOfUnison_; ++u) {
                    positions[u] += (freq * unisonFreqMultipliers_[u]);
                }

                for (auto &pos : positions) {
                    if (pos >= constants::SAMPLE_LENGTH_DOUBLE) {
                        pos = pos - constants::SAMPLE_LENGTH_DOUBLE;
                    }
                }

            }

            // bounce back the osc position's
            for (int u = 0; u < numOfUnison_; ++u) {
                oscState.samplePosition[u] = positions[u];
            }

        }

Not sure why the FloatVectorOperations operations do not increase significantly, they only slow down. I will continue to research and maybe try to run this on Intel soon.

It is also possible that the optimization is smart today and the compiler itself already generates the optimized code in the case when I use simple for or foreach.

Thanks for help!.

It seems like you’re doing a lot of read/write operations and calculations per sample.

I think you should try to reverse that, and instead always do simple operations over whole buffers. In that way you could really get way more milage out of your CPU cache, and then you will also see big advantages by using manual FVO.

That means you were lucky. The API specifically mentiones that the host is allowed to call it as often as it sees fit and that it doesn’t need to be in pairs with releaseResources().

Just as a heads up, it’s your code :wink:

@eyalamir Yes, it turns out a lot write/read operations. But I do not yet understand how to redo it to make it better.

The main task of this code is to get 16 oscillator voices with a stereo and detune.

I store 16 oscillator positions for each voice and move it forward per sample in loop.

Also as you can see there is applied stereo and detune unison arrangement. So I cannot just simply sum the 16 samples without applying a detune/stereo

I also tried to have something like this:

  1. buffer with positions
  2. buffer with 16 unisons
  3. buffer with stereo positioning for each 16 unisons
  4. Apply FloatVector… for all the buffers

But I got the cpu about 86-92%

Also for example I tried to replace this code:

  float positions[numOfUnison_];
            for (int u = 0; u < numOfUnison_; ++u) {
                positions[u] = oscState.samplePosition[u];
            }

with this:

float positions[numOfUnison_];
FloatVectorOperations::copy(positions, oscState.samplePosition, numOfUnison_);

The cpu was increased from to 41-42% up to 47-48%

In this instance using FVO probably won’t help you, as these aren’t the biggest buffers. FVO is helpful with actual audio buffers where you might do the exact same operation over 512 samples, or 1024 samples, etc.

To make buffer (audio) code really efficient, you need to restructure your algorithms so they look more like:

//Pseudo code:
for (int channel = 0; channel < numChannels; ++channel)
{
    auto channelData = buffer.getWritePointer(channel);
    auto numSamples = buffer.getNumSamples();

    processOSC (channelData, numSamples);
    processUnison (channelData, numSamples);
    processFilter (channelData, numSamples);
}

Structuring your code to work this way takes a bit of mental effort, but you will probably see major improvements after it.

one example of prepareToPlay being called again is when the user changes the maximum buffer size or the sampleRate as i mentioned before. but it can also happen at other times i’m not aware of yet

@eyalamir My structure is similar to your pseudo code. The only bottleneck is processOsc, which actually generates the output sound of the oscillators in unison.

The CPU mostly devours all resources by the task to move the “osc position”. (That is, this is the current position of the wave in sample [1024] which runs in a circle and then if I know the position, I can return the amplitude from the wavetable) the position is simply can be: 0-1023, for each voice, for each osc, for each unison.

So I’m moving about 1048576 float values (4 x 16 x 4 x 16 x 256).

4 oversampling
16 voices/notes
4 osc per voice
16 unison per osc
256 buffer size

Without this, I will not be able to create an oscillator buffer, since the sound must constantly move forward, and also be spaced along the left and right channels.

It seems with your help I solved the problem with AudioBuffer. The difference is only 1-2% CPU compared to “c array/std::array” suits me now. I’m just will not use it in big circles.

I will continue to puzzle over how to speed up this data :slight_smile:
It seems moving away the float oscPosition[unisonCount] from voice class into “synth class” as “float oscPosition[1048576]” allows me to achieve acceleration by another 5-10%. But I’m really don’t like this style and it will be difficult to maintain.

@Mrugalla
I’m using buffer_.setSize(n, n); won’t that be enough? Or do you mean that the memory can be fragmented and I’d better allocate a large chunk of memory?

Thank you!

1 Like