Using SIMDRegister is slower than regular multiplications ... Any way to perhaps make it useful?

I am running trying to make math within a for loop cheaper. I must run 150 iterations of the loop per sample at there are about 14 multiplications in it.

This would be very good for 3 rounds of SIMD. However, when I try, it slows things down rather than speeds things up.

I have tried in two ways:

METHOD 1

Declare alignas array outside of loop and reuse this as follows

alignas (16) float vRawA[] = { 0, 0, 0, 0 };
alignas (16) float vRawB[] = { 0, 0, 0, 0 };

for (int i = 0; i < n; i++) { 

vRawA[0] = val1;
vRawA[1] = val2;
vRawA[2] = val3;
vRawA[3] = val4;

vRawB[0] = val5;
vRawB[1] = val6;
vRawB[2] = val7;
vRawB[3] = val8;

auto v1A = juce::dsp::SIMDRegister<float>::fromRawArray(vRawA);
auto v1B = juce::dsp::SIMDRegister<float>::fromRawArray(vRawB);
auto v1Out = v1A * v1B;

double val9 = v1Out.get(0);
double val10 = v1Out.get(1);
double val11 = v1Out.get(2);
double val12 = v1Out.get(3);
		
}

OPTION 2:

Use the fromNative instead:

for (int i = 0; i < n; i++) { 

auto v1A = juce::dsp::SIMDRegister<float>::fromNative({
	(float)val1,
	(float)val2,
	(float)val3,
	(float)val4
});

auto v1B = juce::dsp::SIMDRegister<float>::fromNative({
	(float)val5,
	(float)val6,
	(float)val7,
	(float)val8
});
auto v1Out = v1A * v1B;

double val9 = v1Out.get(0);
double val10 = v1Out.get(1);
double val11 = v1Out.get(2);
double val12 = v1Out.get(3);
}

METHOD 3:

Save the SIMDRegister multiplication objects and set the floats into them:

alignas (16) float vRawA[] = { 0.0f, 0, 0, 0 };
alignas (16) float vRawB[] = { 0.0f, 0, 0, 0 };
auto v1A = juce::dsp::SIMDRegister<float>::fromRawArray(vRawA);
auto v1B = juce::dsp::SIMDRegister<float>::fromRawArray(vRawB);

for (int i = 0; i < n; i++) { 

v1A.set(0, val1);
v1A.set(1, val2);
v1A.set(2, val3);
v1A.set(3, val4);

v1B.set(0, val5);
v1B.set(1, val6);
v1B.set(2, val7);
v1B.set(3, val8);

auto v1Out = v1A * v1B;

double val9 = v1Out.get(0);
double val10 = v1Out.get(1);
double val11 = v1Out.get(2);
double val12 = v1Out.get(3);
}

All three of these in real world application (using three sets of SIMD multiplications in each loop iteration then summing the results) are slower than without them at all. This is on an Intel CPU and Windows.

My only guesses why this isnā€™t helping to truly ā€œoptimizeā€ anything are:

  • I am casting to float for the SIMD operations as my values are in double - maybe all the casting is whatā€™s killing me?
  • Maybe the get and set functions or making the objects are what is so inefficient?
  • The cost of shuffling all this data around exceeds the benefit of the SIMD (memory limitations?)
  • The compiler was already applying some type of SIMD automatically perhaps?
  • Autovectorization makes it faster without SIMD.

This is surprisingly disappointing. This seems like the perfect place for SIMD (lots of multiplications in one place) and yet no benefit and only harm so far.

I also have read that SIMD is sometimes slower because it ā€œexposes memory bottlenecks.ā€ I am not sure what is meant by that. Is it the problem of all the shuffling around of data between all these various places? Maybe I am already riding a bottleneck there and this is exposing that.

Re: Autovectorization, I see other people finding this with loops too here:

And it is explained here:

Basically the compiler optimizes for loops specifically in certain ways and probably then the SIMD operations are getting in the way of that. Perhaps that is most likely.

Maybe I simply canā€™t optimize more then.

Thanks for any thoughts.

When there are recursive stuff, itā€™s harder to get good performance boost.

Still, you need to keep stuff in register as much as possible and only put back stuff in member variable when the loop is finished
So you load the previous state before the loop in register, in the loop do everything in register and shift in it like a ring buffer, then hen you exit the for, you put back in variable.

The trick is not to try being too clever about it. Hereā€™s an example how to get things done efficiently using SIMD:

using Vec4 = std::array<float, 4>;


template<typename T, unsigned long N>
inline const std::array<T, N>& operator += (std::array<T, N>& result, const std::array<T, N>& v)
{
    for (int i = 0; i < N; ++i)
        result[i] += v[i];
    
    return result;
}


Vec4 Add(Vec4 a, Vec4 b)
{
    a += b;

    return a;
}

And here is what Clang compiles it into:

Add(std::array<float, 4ul>, std::array<float, 4ul>):               # @Add(std::array<float, 4ul>, std::array<float, 4ul>)
        vaddps  xmm0, xmm2, xmm0
        vaddps  xmm1, xmm3, xmm1
        ret

You can try this example for yourself and modify it on Godbolt:

1 Like

When it comes to performance tuning and closing the best way to solve a problem, the first thing is always trying to understand the problem as best as possible. In your case the problem is that you want your algorithm to perform better if possible. You seem to assume that a SIMD based multiplication is the key for the desired speedup. This implies that you assume that the compiler does not already vectorise your code anyway and that a major amount of time is spent performing that multiplication.

My usual workflow would be:

  • Find out where the most time is spent in the code
  • Maybe inspect the assembly that the compiler generates for that code already
  • Think of possible approaches to refactor the problematic code
  • Test the refactored code with the same tools as the test before was performed and identify which approach helped most
  • Inspect the assembly that the compiler generates for the new code to verify that my changes actually lead to what I expected

Having done this a few times you start to identify some best practice patterns or anti patterns that apply most of the times, but when it comes to slightly more complex code, there is often the need to inspect the very specific piece of code in its context to come to a conclusion whatā€™s the best way to speed it up.

So itā€™s a bit difficult to give some good advice about whatā€™s the best approach without the wider context of your code and some proper analysis.

So maybe a few words on tools for this analysis. The starting point should be a symbolicated release build (never base performance analysis on debug builds) of the real world application or plugin that is run with a time profiler attached. On macOS the best choice is Apple Instruments, on Windows the Visual Studio bundled profiler works well, Intel VTune also proved to be a good choice some time ago on Windows and x86-based Linux. Run the app or plugin under real world circumstances and look for hot stack traces on the relevant threads.
If you donā€™t see the function with these multiplications popping up here you might want to reconsider how much of an issue your multiplications even are.

Chances are, that your compiler was already clever enough to optimise the code in unexpected ways. A good way to find that out is analysing the relevant functions in compiler explorer. This can be done online via godbolt.org. If your code relies much on third party libraries not available on godbolt.org you can also host compiler explorer locally on your build machine and make it access all your local dependencies. With compiler explorer you can then see how the code translates to assembly code and where the compiler already does auto vectorization.

If you can identify that the compiler does not auto vectorise a function that pops up heavily in the profiling results, it might be a good idea to roll out manual vectorisation. Go on and try out various approaches and see how it translates to assembly in compiler explorer and even more important how the profiling results change.

I see that this answer is more generic than what you might have expected. In general looking at your code Iā€™m somewhat surprised that the compiler didnā€™t optimise a lot away from your test implementations, so I wonder if you really tested a release build? The juce SIMDRegister classes are known to perform extremely bad in debug builds, usually a lot slower than straightforward scalar implementations while then gaining exceptional performance increase in release builds. Also if you showed a bit more context of your actual function we could give you some more ideas how to approach optimisation for that specific bit of code.

2 Likes

Yes, the JUCE SIMD class in debug builds is painfully slow. Like oftentimes way, way slower than not using SIMD at all.

One possible way around this, if itā€™s suitable, is to keep a non-SIMD version of your code method(s) as a reference and #if that in debug builds. Iā€™d call such code the ā€˜referenceā€™ code, because often once youā€™re done with SIMD things look quite different and itā€™s handy to have the original idea to reference.

You can either use that for debug builds or even use it for unit testing too.