I am running trying to make math within a for loop cheaper. I must run 150 iterations of the loop per sample at there are about 14 multiplications in it.
This would be very good for 3 rounds of SIMD. However, when I try, it slows things down rather than speeds things up.
I have tried in two ways:
METHOD 1
Declare alignas array outside of loop and reuse this as follows
alignas (16) float vRawA[] = { 0, 0, 0, 0 };
alignas (16) float vRawB[] = { 0, 0, 0, 0 };
for (int i = 0; i < n; i++) {
vRawA[0] = val1;
vRawA[1] = val2;
vRawA[2] = val3;
vRawA[3] = val4;
vRawB[0] = val5;
vRawB[1] = val6;
vRawB[2] = val7;
vRawB[3] = val8;
auto v1A = juce::dsp::SIMDRegister<float>::fromRawArray(vRawA);
auto v1B = juce::dsp::SIMDRegister<float>::fromRawArray(vRawB);
auto v1Out = v1A * v1B;
double val9 = v1Out.get(0);
double val10 = v1Out.get(1);
double val11 = v1Out.get(2);
double val12 = v1Out.get(3);
}
OPTION 2:
Use the fromNative
instead:
for (int i = 0; i < n; i++) {
auto v1A = juce::dsp::SIMDRegister<float>::fromNative({
(float)val1,
(float)val2,
(float)val3,
(float)val4
});
auto v1B = juce::dsp::SIMDRegister<float>::fromNative({
(float)val5,
(float)val6,
(float)val7,
(float)val8
});
auto v1Out = v1A * v1B;
double val9 = v1Out.get(0);
double val10 = v1Out.get(1);
double val11 = v1Out.get(2);
double val12 = v1Out.get(3);
}
METHOD 3:
Save the SIMDRegister multiplication objects and set the floats into them:
alignas (16) float vRawA[] = { 0.0f, 0, 0, 0 };
alignas (16) float vRawB[] = { 0.0f, 0, 0, 0 };
auto v1A = juce::dsp::SIMDRegister<float>::fromRawArray(vRawA);
auto v1B = juce::dsp::SIMDRegister<float>::fromRawArray(vRawB);
for (int i = 0; i < n; i++) {
v1A.set(0, val1);
v1A.set(1, val2);
v1A.set(2, val3);
v1A.set(3, val4);
v1B.set(0, val5);
v1B.set(1, val6);
v1B.set(2, val7);
v1B.set(3, val8);
auto v1Out = v1A * v1B;
double val9 = v1Out.get(0);
double val10 = v1Out.get(1);
double val11 = v1Out.get(2);
double val12 = v1Out.get(3);
}
All three of these in real world application (using three sets of SIMD multiplications in each loop iteration then summing the results) are slower than without them at all. This is on an Intel CPU and Windows.
My only guesses why this isnāt helping to truly āoptimizeā anything are:
- I am casting to float for the SIMD operations as my values are in double - maybe all the casting is whatās killing me?
- Maybe the
get
andset
functions or making the objects are what is so inefficient? - The cost of shuffling all this data around exceeds the benefit of the SIMD (memory limitations?)
- The compiler was already applying some type of SIMD automatically perhaps?
- Autovectorization makes it faster without SIMD.
This is surprisingly disappointing. This seems like the perfect place for SIMD (lots of multiplications in one place) and yet no benefit and only harm so far.
I also have read that SIMD is sometimes slower because it āexposes memory bottlenecks.ā I am not sure what is meant by that. Is it the problem of all the shuffling around of data between all these various places? Maybe I am already riding a bottleneck there and this is exposing that.
Re: Autovectorization, I see other people finding this with loops too here:
And it is explained here:
Basically the compiler optimizes for loops specifically in certain ways and probably then the SIMD operations are getting in the way of that. Perhaps that is most likely.
Maybe I simply canāt optimize more then.
Thanks for any thoughts.