I’ve set the the oscillator to do all the unison voices (not synth voices) in the same tick, so I thought about calculating all the the 8 unison with SIMD and then just summing the active ones to the output. I’ve profiled it so I know for sure the linear interpolation is the most CPU consuming part by far (54% of time) followed by accessing the wavetables. But I’m not getting any benefit from doing it in SIMD + release build rather than just release build.
Original:
Tick
for (int n=0; n<unisonVoices; n++)
{
//... phase and wavetable access managing stuff
tmpOut[u] = x0[n] + fraccional[n] * (x1[n] - x0[n]);
}
// ... more code summing the tmpOut to the output
SIMD:
Variables:
alignas (16) float fraccional4[4];
alignas (16) float fraccional8[4];
alignas (16) float x04[4];
alignas (16) float x08[4];
alignas (16) float x14[4];
alignas (16) float x18[4];
alignas (16) float tmpOut4[4];
alignas (16) float tmpOut8[4];
Tick:
SIMDRegister<float> simd_x04 = dsp::SIMDRegister<float>::fromRawArray(x04);
SIMDRegister<float> simd_x08 = dsp::SIMDRegister<float>::fromRawArray(x08);
SIMDRegister<float> simd_x14 = dsp::SIMDRegister<float>::fromRawArray(x14);
SIMDRegister<float> simd_x18 = dsp::SIMDRegister<float>::fromRawArray(x18);
SIMDRegister<float> simd_tmpOut4 = simd_x04 + simd_fraccional4 * (simd_x14-simd_x04);
SIMDRegister<float> simd_tmoOut8 = simd_x08 + simd_fraccional8 * (simd_x18-simd_x08);
simd_tmpOut4.copyToRawArray(tmpOut4);
simd_tmpOut8.copyToRawArray(tmpOut8);
// ... more code summing the tmpOut4 and tmpOut8 to the output
Also tried with multiplyAdd but the result is the same.
Profiling results:
Original tick code (Release build): 57.5% of time. 3 most consuming instructions: movl
(13.79%), movl
(5.56%), movss
(4.73%). The rest (maths included) are below 3%.
SIMD tick code (Release build): 54.4% of time. 3 most consuming instructions: movl
(8.07%), movaps
(6.27%), movq
(4.08%). The rest (maths included) are 2% and below.
I don’t know a lot about assembly code but seems like moving registers are the most expensive operations. Is memory access the bottleneck here? Or is the compiler just vectorizing well enough so the maths aren’t the higher cost anymore?