I also made the experience that you need at least a black belt in SIMD tuning to get stuff like linear interpolation faster than it’s naive implementation.
The thing with linear interpolation is that it can have any stride factor resulting in unpredictable memory access, which is the bottleneck. The raw mathematics are so trivial that it doesn’t matter whether they are SIMDed or not.
But I also started using this class recently and I got the best performance increases by replacing repetitive calls to FloatVectorOperations functions with tight loops that operate on the hot data SIMD-style:
FloatVectorOperations::multiply(data, 2.0f, numSamples);
FloatVectorOperations::add(data, otherData, numSamples);
FloatVectorOperations::multiply(data, -1.0f, numSamples);
// becomes something like this
using SSEType = dsp::SIMDRegister<float>;
int numLoop = numSamples / (SSEType::RegisterSize);
while(--numLoop >= 0)
auto a = SSEType::fromRawArray(data);
auto b = SSEType::fromRawArray(otherData);
a *= b * SSEType::expand(-1,0f);
data += SSEType::RegisterSize;
otherData += SSEType::RegisterSize;
Another thing I noticed regarding the SIMDRegister class is that it requires the AVX2 compiler flag in order to use the AVX register size. This is a pretty tight requirement since they are a lot of CPUs that don’t have AVX2 (I am typing this without AVX for example), but AVX should be the lowest common denominator by now (I think all CPUs since 2011 support it and who uses anything older for serious audio stuff).
Are there some things missing in the AVX instruction set or is there another reason for this decision?