No performance improvement with FloatVectorOperations

This nicely illustrates an issue that made me stop using FloatVectorOperations at all.

In the second example, using FloatVectorOperations leads to the currentSample buffer being read and written twice, while the first example only reads and writes it once. Clang is pretty good at vectorizing relatively simple loops like this one. This means there really is no point in using FloatVectorOperations. As far as I know, no compiler can combine multiple loops into one to achieve the same result as the first example.

The operations provided by FloatVectorOperations mostly contain very little calculations per sample. This means speed is limited by the reading and writing of the data and the same memory locations are read and written many times when using multiple FloatVectorOperations - severly limiting the performance gain achievable using this one-operation per call approach.

3 Likes

Here’s how I do it, simplified for the demo here.
I have a custom data class that is allways approachable as vector AND float.
This allways gives the fastest code.

//  make a vector array sized 1024
juce::dsp::SIMDRegister<float> v[1024]; 
//  now make a float pointer to it
float* f = reinterpret_cast<float*>(v); 

// example of adding a float using floats
int numOfFloats = dsp::SIMDRegister<float>::SIMDNumElements * 1024;
 for(int ff=0; ff< numOfFloats;  ff++) f[ff]+=3.14;

//Same example, now using vectors
 dsp::SIMDRegister<float> valV = dsp::SIMDRegister<float>::expand(3.14);
 for(int vv=0; vv<1024;  vv++) v[vv]+=valV;

The cute thing is that this avoids a lot of fumbling, you can allways acces everything as a float (the f array) or as a vector (the v array).
You might need to use alignas() to align the data to the vectorsize.

Thanks for the response. I would not say so :blush: ), I played a little without going my head into the % of load. But using “for loop” it gives me +4-6 voices that I can hear without wheeze. While the FloatVectorOperations is already starting to wheeze. And the same % shows me benchmark when I’m detecting it by start/end time difference (in realtime audio and simple run from main()).

Yes, sorry, I mixed up AVX with Apple’s SIMD. In any case, everything you mentioned above makes sense. But then the question arises where to use FloatVectorOperations ? It looks like nowhere ?

Well, I just tested it. And here is results :slight_smile: :

juce::dsp::SIMDRegister:

static constexpr size_t size = 1024 * 1024;
auto valV = dsp::SIMDRegister<float>::expand(3.14F);
auto *array = new juce::dsp::SIMDRegister<float>[size]{};

auto start_time = std::chrono::high_resolution_clock::now();

for (size_t i = 0; i < size; i++) {
    array[i] += valV;
}

auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);
std::cout << "Elapsed time: " << duration.count() << " microseconds\n";

delete[] array;

Elapsed time: 1705 microseconds
Elapsed time: 663 microseconds
Elapsed time: 631 microseconds
Elapsed time: 763 microseconds
Elapsed time: 682 microseconds
Elapsed time: 700 microseconds
Elapsed time: 522 microseconds
Elapsed time: 540 microseconds
Elapsed time: 608 microseconds

Simple float loop

static constexpr size_t size = 1024 * 1024;
auto valV = 3.14F;
auto *array = new float [size]{};
  
auto start_time = std::chrono::high_resolution_clock::now();

for (size_t i = 0; i < size; i++) {
    array[i] += valV;
}

auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);
std::cout << "Elapsed time: " << duration.count() << " microseconds\n";

delete[] array;

Elapsed time: 120 microseconds
Elapsed time: 118 microseconds
Elapsed time: 124 microseconds
Elapsed time: 119 microseconds
Elapsed time: 134 microseconds
Elapsed time: 128 microseconds
Elapsed time: 89 microseconds
Elapsed time: 121 microseconds
Elapsed time: 124 microseconds

So as for me it’s in avg slower in 5 times. Or I misunderstood and the code above for SIMD did wrong.

almost there!
but you compare 1024 * 1024 vectors done against 1024 * 1024 floats
that is not correct for a comparision.
Look at my example: the number of floats should be :
numberOfvectors * dsp::SIMDRegister::SIMDNumElements

Also note: compiler optimisation might give simular results. But with SIMD and a plan, you have a grip on the optimisation: it will not break so easily. I use a lot of non-branching code with SIMD masks.

And: the reinterpret_cast<float*> bit is important for me in a practical sense: you can just skip to non-vectorised code and back without datashuffling.

@PaulDriessen Ahh, you are right!!!. I forgot that that it already has 4 values.

So the result is now very close to the basic “for loop”.

Elapsed time: 122 microseconds
Elapsed time: 133 microseconds
Elapsed time: 399 microseconds
Elapsed time: 211 microseconds
Elapsed time: 127 microseconds
Elapsed time: 143 microseconds
Elapsed time: 134 microseconds
Elapsed time: 172 microseconds

(Update)
JFYI. I have following flags:
-ffast-math
-Ofast
-flto
-stdlib=libc++
-funroll-loops
-ftree-vectorize
-fvectorize

1 Like

But the OP issue is also here noticeable indeed! did not expect that.
The optimiser does an amazing job: it is very hard to write code that surpasses that.
I’m actually glad I have this dataclass with builin handling methods: I can easily check them for speed. I suspect often the optimiser is faster than using SIMD.
My suspicion is that in optimisation level O3, loops are “rolled out” and that this doesnt equally happen with the SIMD stuff. Just a hunch.
Maybe someone else knows more about it?

1 Like

As someone who did performance optimization consulting for 10+ years and has been extensively benchmarking IPP/vDSP functions the last months (literally was doing it when reading this thread) I was ready to fly into this thread guns blazing, 100% sure that mistakes had been made. vDSP will always be faster!

So lets test it with a real benchmark library, with cache warming, guards against compiler optimizing away the result, a decent n number of iterations, etc! And actually use the correct vDSP call for this use case!

So I tried it out with 512 samples in a plain old std::vector.

Uh oh…

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
loop                                           100           212     2.2684 ms 
                                        106.375 ns     105.85 ns    107.404 ns 
                                        3.55755 ns    1.91206 ns    5.38678 ns 
                                                                               
FloatVectorOperations copy then add            100            53     2.2843 ms 
                                        440.438 ns    439.298 ns    444.786 ns 
                                        10.0575 ns    2.32347 ns    23.3693 ns 
                                                                               
vDSP_vsmsma                                    100            44     2.2924 ms 
                                        542.155 ns    536.908 ns    556.661 ns 
                                        41.1198 ns    17.1323 ns    87.2612 ns
Code: Catch2 benchmarks for 512 samples
SECTION ("nerd sniped")
    {
        std::vector<float> A;
        std::vector<float> B;
        std::vector<float> result;
        A.resize (512);
        B.resize (512);
        result.resize (512);
        float alpha = 3.5f;
        float beta = 1.2f;

        BENCHMARK ("loop")
        {
            for (size_t i = 0; i < result.size(); ++i)
            {
                result[i] = A[i] * alpha + B[i] * beta;
            }
            return result;
        };

        BENCHMARK ("FloatVectorOperations copy then add")
        {
            juce::FloatVectorOperations::copyWithMultiply (result.data(), A.data(), alpha, 512);
            juce::FloatVectorOperations::addWithMultiply (result.data(), B.data(), beta, 512);
            return result;
        };

        BENCHMARK ("vDSP_vsmsma")
        {
            vDSP_vsmsma (A.data(), 1, &alpha, B.data(), 1, &beta, result.data(), 1, 512);
            return result;
        };
    }

At 512 samples, not only is the raw loop faster, but the vDSP specific call is the worst performer. :sob:


What about smaller sample blocks? Here’s 64 samples:

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
loop                                           100           637     2.2295 ms 
                                        34.8611 ns    34.8395 ns    34.8957 ns 
                                         0.1366 ns  0.0943064 ns   0.246368 ns 
                                                                               
FloatVectorOperations copy then add            100           529     2.2218 ms 
                                        43.3149 ns    42.8108 ns    44.8642 ns 
                                        3.93341 ns   0.665487 ns    8.53929 ns 
                                                                               
vDSP_vsmsma                                    100           626     2.2536 ms 
                                        35.9621 ns    35.9349 ns    36.0452 ns 
                                       0.221448 ns  0.0873736 ns   0.490156 ns                                                                

Interesting! At 64 samples, vDSP_vsmsma and the raw loop are now about equal.

BUT WAIT THERE’S MORE!!

Right! I was using std::vector above, with no attention to alignment.

Check things out with AudioBlock, with its default alignment of sizeof (SIMDRegister<NumericType>)

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
loop                                           100          2006     2.2066 ms 
                                        11.2967 ns    11.2277 ns    11.6072 ns 
                                       0.638054 ns  0.0714107 ns    1.51036 ns 
                                                                               
FloatVectorOperations copy then add            100          1209     2.1762 ms 
                                        18.8336 ns    18.5334 ns    19.3757 ns 
                                        2.00456 ns    1.31133 ns    3.17008 ns 
                                                                               
FloatVectorOperations add then add             100          1097      2.194 ms 
                                        21.4337 ns    21.0371 ns    22.3376 ns 
                                        2.87391 ns    1.53503 ns    5.18361 ns 
                                                                               
vDSP_vsmsma                                    100          1966     2.1626 ms 
                                        10.7294 ns    10.7214 ns     10.753 ns 
                                      0.0649882 ns  0.0280217 ns   0.141415 ns                                                         
Code for Audioblock, 64 samples
SECTION ("nerd sniped")
    {
        // use audio blocks to ensure alignment
        juce::HeapBlock<char> aData;
        juce::dsp::AudioBlock<float> a = { aData, 1, 512 };

        juce::HeapBlock<char> bData;
        juce::dsp::AudioBlock<float> b = { bData, 1, 512 };

        juce::HeapBlock<char> resultData;
        juce::dsp::AudioBlock<float> result = { resultData, 1, 512 };
        float alpha = 3.5f;
        float beta = 1.2f;

        BENCHMARK ("loop")
        {
            for (int i = 0; i < (int) result.getNumSamples(); ++i)
            {
                result.setSample(0, i, alpha * a.getSample(0, i) + beta * b.getSample(0, i));
            }
            return result.getChannelPointer(0);
        };

        BENCHMARK ("FloatVectorOperations copy then add")
        {
            juce::FloatVectorOperations::copyWithMultiply (result.getChannelPointer(0), a.getChannelPointer(0), alpha, 64);
            juce::FloatVectorOperations::addWithMultiply (result.getChannelPointer(0), b.getChannelPointer(0), beta, 64);
            return result.getChannelPointer(0);
        };

        BENCHMARK ("vDSP_vsmsma")
        {
            vDSP_vsmsma (a.getChannelPointer(0), 1, &alpha, b.getChannelPointer(0), 1, &beta, result.getChannelPointer(0), 1, 64);
            return result.getChannelPointer(0);
        };
    }

Vindicated! :smiling_face_with_three_hearts::smiling_face_with_three_hearts::smiling_face_with_three_hearts: (Just barely!)

With properly aligned memory, FloatVectorOperations is over 2x faster than the raw loop. With the vDSP call specific to the need, it’s just about 5x faster than the aligned raw loop and 10x faster than the unaligned raw loop.

Edit, whups, got greedy there, there was a convenient typo — At 64 samples, the “right” vDSP function is only slightly faster than the raw loop. But note the raw loop is almost 3x faster when aligned!

What about with AudioBlock’s default alignment with 512 samples?

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
loop                                           100           459     2.2491 ms 
                                        49.9297 ns     48.795 ns    53.1631 ns 
                                        8.86625 ns    3.14064 ns    18.6949 ns 
                                                                               
FloatVectorOperations copy then add            100            32     2.3136 ms 
                                        719.249 ns    713.638 ns    737.375 ns 
                                        45.3276 ns    10.7257 ns    98.9282 ns 
                                                                               
vDSP_vsmsma                                    100            62     2.2754 ms 
                                        356.814 ns     355.94 ns    358.407 ns 
                                        5.84885 ns    3.14928 ns    9.64479 ns 

Yikes, looks like the raw loop wins.

So the lesson learned…alignment matters. And for simple loops, it’s better to benchmark vectorized versions before making assumptions.

I’ve been learning a lot of these lessons over and over again lately (be careful making assumptions about performance, be careful about making too many generalizations, triple check all the numbers). For example, I’m not 100% convinced that returning the std::vector/raw pointer is truly enough to avoid the compiler from optimizing some of the code away, but I’ve tried a few alternatives and got the same consistent result. It’s why I prefer tools like perfetto than can measure real time performance in-app.

3 Likes

ok, so the pattern looks kind of;
-alignment matters
-compiler optimisation is hard to beat
Astounding, I must say.
The last hardcore vector function I build uses tailor series to build the sin and cos in 1 run out of a polar input array. That actually won from normal code, mostly because it combined the 2 functions into 1 tailor row. And alignment was crucial there too ofcourse.
But I have to doublecheck some more stuff, that is clear!

There was a SIMD talk I saw I’ll have to dig up which talked about avoiding vector sizes that are exact
powers of 2 for best performance.

I tried 511 items here instead of 512 and it seems to put vDSP_vsmsma back on top on my machine:

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
loop                                           100           401     2.2456 ms 
                                        56.2942 ns    56.2568 ns    56.3472 ns 
                                       0.225151 ns   0.172133 ns   0.331674 ns 
                                                                               
FloatVectorOperations copy then add            100           279     2.2599 ms 
                                         83.288 ns    81.6942 ns    85.7759 ns 
                                        10.0145 ns     7.1822 ns    14.0543 ns 
                                                                               
vDSP_vsmsma                                    100           472     2.2656 ms 
                                         47.617 ns     47.361 ns    48.3056 ns 
                                        1.96311 ns   0.832855 ns    3.89036 ns 
                                                                            
code
    SECTION ("nerd sniped")
    {
        // use audio blocks to ensure alignment
        juce::HeapBlock<char> aData;
        juce::dsp::AudioBlock<float> a = { aData, 1, 511 };

        juce::HeapBlock<char> bData;
        juce::dsp::AudioBlock<float> b = { bData, 1, 511 };

        juce::HeapBlock<char> resultData;
        juce::dsp::AudioBlock<float> result = { resultData, 1, 511 };
        float alpha = 3.5f;
        float beta = 1.2f;

        BENCHMARK ("loop")
        {
            for (int i = 0; i < (int) result.getNumSamples(); ++i)
            {
                result.setSample(0, i, alpha * a.getSample(0, i) + beta * b.getSample(0, i));
            }
            return result.getChannelPointer(0);
        };

        BENCHMARK ("FloatVectorOperations copy then add")
        {
            juce::FloatVectorOperations::copyWithMultiply (result.getChannelPointer(0), a.getChannelPointer(0), alpha, result.getNumSamples());
            juce::FloatVectorOperations::addWithMultiply (result.getChannelPointer(0), b.getChannelPointer(0), beta, result.getNumSamples());
            return result.getChannelPointer(0);
        };


        BENCHMARK ("vDSP_vsmsma")
        {
            vDSP_vsmsma (a.getChannelPointer(0), 1, &alpha, b.getChannelPointer(0), 1, &beta, result.getChannelPointer(0), 1, result.getNumSamples());
            return result.getChannelPointer(0);
        };
1 Like

Was it the vectorisation talk from ADC2018?
Was both the power of 2 thing and the alignment thing to do with optimal caching?

Ah, it wasn’t a simd talk, but a cpu cache talk. The bit I was remembering was about cache associativity, which had to do with stride, so unsure if related (would have to look under the hood at what vdsp is actually doing?..) (@ 51:15) CPU Cache Effects - Sergey Slotin - Meeting C++ 2022 - YouTube

@PaulDriessen Have to say Thank you!!! I spent some time to rewrite my osc and oscState classes with a juce::dsp::SIMDRegister and by an amazing miracle, the load dropped 2 times :slight_smile:

I also created my own AudioBuffer which uses juce::dsp::SIMDRegister. So now I have max 31% CPU Load on Apple A12X with 16 voices x 4 osc x 16 unison x 4 oversampling + 2 moog filters on each voice smooth wavetable with a crossfade a-b + FM Modulation. That’s amazing) Before it was 58% cpu/load. So now I can use 32 voices with 57% cpu.

image
btw without filters it’s now just 21% cpu/load.

Just for the comparison. With same setup here is some commercial synth load :stuck_out_tongue_closed_eyes:
telegram-cloud-photo-size-2-5235500383353291752-x

This makes me think that SIMD is a worthwhile thing!. But unfortunately with FloatVectorOperations I was not able to repeat the same thing, the load only grew in 1.5-2.

One very important thing. Always use static values in for loop. So the compiler will successfully unroll them. It doesn’t unroll them if you pass a non-constexpr value and CPU load increases exponentially.

For example, let’s imagine we work with unison:


const int numOfUnison = osc.getCurrentNumOfUnison();

for (int u = 0; u < numOfUnison; ++u) {
   // BAD! the compiler does not know what the value in the numOfUnison (it is not static), so the unroll will not be performed.
}

constexpr int kNumOfUnison = 16;

for (int u = 0; u < kNumOfUnison; ++u) {
   // good, compilator know the value of kNumOfUnison and unroll loop can be performed.
}

But If you still… need a different numOfUnison value, then you can try:
This worked for me fine with some very small cost.

for (int u = 0; u < 16; ++u) {
   if (u >= numOfUnison) { continue; }
   // build unison sound...
}
1 Like

thanks vtarasuk!
the unrolling is what bothers me, in my code the compilers most likely doesn’t know the target yet.
Have you compared your SIMD vs. -03 optimised code?

I haven’t tested it without -03. I only did launches with OFast and LTO. With a optimized loop it was about 57% and with a SIMDRegister - 31%.

But I did one trick. Now instead of using float phase I use now uint32_t and it just increment value without overflow check (i.e. without if (phase >= 1024) { phase -= 1024 }). So the phase, basically starts from 0 to 4294967295 and always go in circles. During the reading sample I just need to bit shift this value by >> 22 to get value in range of 0-1023.

So basically what I did.

  1. Because originally I used my custom AudioBuffer. I replaced T *array with a SIMD array.
  2. Operations add and clear now uses SIMD.
  3. Things like phase[NumberOfUnison] and incermentFactor[NumberOfUnison] are SIMD arrays now.
  4. Summation of sound from OSC into a voice buffer. That’s I think gave the most effective boost up to 12-15% cpu.
  5. I call now separate methods for 16 unison, 12, 8 and 4 pieces. Otherwise if I use a non-static size in for loop for numOfUnison, then the CPU just jumps by 15-20% and all optimization is killed. So always try to use static length in your loop.
using T = float;
using SIMDFloatType = juce::dsp::SIMDRegister<T>;
using SIMDUIntType = juce::dsp::SIMDRegister<uint32_t>;
static constexpr int MaxUIntUnisonSIMDElements = utils::ceil(T(Setup::MaxNumOfUnison) / SIMDUIntType::SIMDNumElements);
static constexpr int MaxFloatUnisonSIMDElements = utils::ceil(T(Setup::MaxNumOfUnison) / SIMDFloatType::SIMDNumElements);

struct OscillatorState {
  int numberOfUnison{16};
  T frequency{220.0};
  SIMDUIntType incrementFactorSIMD[MaxUIntUnisonSIMDElements]{};
  SIMDUIntType phaseSIMD[MaxUIntUnisonSIMDElements]{};
  SIMDFloatType gainSIMD[Setup::NumOfChannels][MaxFloatUnisonSIMDElements]{};
  WaveformModulatorState modulatorState{};
} JUCE_PACKED;
static inline void bounceSoundForUnison16(T *buffer[], const OscillatorState &oscillatorState, const int i, const SIMDType sound[]) {
#pragma unroll
            for (int ch = 0; ch < Setup::NumOfChannels; ++ch) {
                buffer[ch][i] += (sound[0] * oscillatorState.gainSIMD[ch][0]).sum()
                                 + (sound[1] * oscillatorState.gainSIMD[ch][1]).sum()
                                 + (sound[2] * oscillatorState.gainSIMD[ch][2]).sum()
                                 + (sound[3] * oscillatorState.gainSIMD[ch][3]).sum();
            }
        }

static inline void bounceSoundForUnison12(T *buffer[], const OscillatorState &oscillatorState, const int i, const SIMDType sound[]) {
#pragma unroll
            for (int ch = 0; ch < Setup::NumOfChannels; ++ch) {
                buffer[ch][i] += (sound[0] * oscillatorState.gainSIMD[ch][0]).sum()
                                 + (sound[1] * oscillatorState.gainSIMD[ch][1]).sum()
                                 + (sound[2] * oscillatorState.gainSIMD[ch][2]).sum();
            }
        }

static inline void bounceSoundForUnison8(T *buffer[], const OscillatorState &oscillatorState, const int i, const SIMDType sound[]) {
#pragma unroll
            for (int ch = 0; ch < Setup::NumOfChannels; ++ch) {
                buffer[ch][i] += (sound[0] * oscillatorState.gainSIMD[ch][0]).sum()
                                 + (sound[1] * oscillatorState.gainSIMD[ch][1]).sum();
            }
        }

static inline void bounceSoundForUnison4(T *buffer[], const OscillatorState &oscillatorState, const int i, const SIMDType sound[]) {
#pragma unroll
            for (int ch = 0; ch < Setup::NumOfChannels; ++ch) {
                buffer[ch][i] += (sound[0] * oscillatorState.gainSIMD[ch][0]).sum();
            }
        }

The main loaded code for me stays now this one: (it’s stay unchanged and doesn’t use SIMD, I haven’t figured out what to do here yet.):

const auto *phases = reinterpret_cast<uint32_t *>(oscillatorState.phaseSIMD);
std::transform(phases, phases + NumberOfUnison, output, [sampleA, sampleB, alpha, beta](uint32_t idx) {
    const uint32_t pos = idx >> 22;
    return sampleA[pos] * alpha + sampleB[pos] * beta;
});

Parts of the code that are significantly faster than “for loop”.

// Increment the phase array using std::transform.
std::transform(oscillatorState.phaseSIMD, oscillatorState.phaseSIMD + UnisonUIntSIMDCount, oscillatorState.incrementFactorSIMD, oscillatorState.phaseSIMD, std::plus<>());

BTW. One strange thing. If I’m change this from std::transform into (for int i= 0…) then it will be slower )) I don’t know what the compiler is doing here.