My SIMD code barely improves performance

I’ve set the the oscillator to do all the unison voices (not synth voices) in the same tick, so I thought about calculating all the the 8 unison with SIMD and then just summing the active ones to the output. I’ve profiled it so I know for sure the linear interpolation is the most CPU consuming part by far (54% of time) followed by accessing the wavetables. But I’m not getting any benefit from doing it in SIMD + release build rather than just release build.

Original:
Tick

for (int n=0; n<unisonVoices; n++)
{
//... phase and wavetable access managing stuff
tmpOut[u] = x0[n] + fraccional[n] * (x1[n] - x0[n]); 
}
// ... more code summing the tmpOut to the output

SIMD:
Variables:

    alignas (16) float fraccional4[4];
    alignas (16) float fraccional8[4];
    alignas (16) float x04[4];
    alignas (16) float x08[4];
    alignas (16) float x14[4];
    alignas (16) float x18[4];
    alignas (16) float tmpOut4[4];
    alignas (16) float tmpOut8[4];

Tick:

    SIMDRegister<float> simd_x04 = dsp::SIMDRegister<float>::fromRawArray(x04);
    SIMDRegister<float> simd_x08 = dsp::SIMDRegister<float>::fromRawArray(x08);
    SIMDRegister<float> simd_x14 = dsp::SIMDRegister<float>::fromRawArray(x14);
    SIMDRegister<float> simd_x18 = dsp::SIMDRegister<float>::fromRawArray(x18);
        
    SIMDRegister<float> simd_tmpOut4 = simd_x04 + simd_fraccional4 * (simd_x14-simd_x04);
    SIMDRegister<float> simd_tmoOut8 = simd_x08 + simd_fraccional8 * (simd_x18-simd_x08);

    simd_tmpOut4.copyToRawArray(tmpOut4);
    simd_tmpOut8.copyToRawArray(tmpOut8);

    // ... more code summing the tmpOut4 and tmpOut8 to the output

Also tried with multiplyAdd but the result is the same.

Profiling results:
Original tick code (Release build): 57.5% of time. 3 most consuming instructions: movl (13.79%), movl (5.56%), movss (4.73%). The rest (maths included) are below 3%.
SIMD tick code (Release build): 54.4% of time. 3 most consuming instructions: movl (8.07%), movaps (6.27%), movq (4.08%). The rest (maths included) are 2% and below.

I don’t know a lot about assembly code but seems like moving registers are the most expensive operations. Is memory access the bottleneck here? Or is the compiler just vectorizing well enough so the maths aren’t the higher cost anymore?

Moving registers should be cheap.

Maybe it was already optimised to SIMD by the compiler? Check disassembly of non-SIMD version for SIMD instructions??

https://www.agner.org/optimize/ generally useful.

2 Likes

Thanks for the resources.
It seems it’s using SIMD (SSE) instructions like mulps, movaps, etc. but those have are a really low impact on CPU.

I’m afraid the problem is the cache locality, since it’s an FM synth and I can’t render the whole buffer size of samples, I can only render 1 tick of each oscillator at a time to modulate the others oscillators (or can I?). When I was doing the whole buffer at a time for a normal substractive synth it improved a ton, and I guess SIMD would benefit there much more than only doing 8 operations in 2 SIMD registers each tick.

Have you tried rendering each oscillator block one at a time, and referencing the stored output for your FM work? I’ve had quite the success of storing LFO outputs and referring to them later.
The cache doesn’t get hit much and it simplifies a lot of the synthesis process.