My SIMD code barely improves performance

johngalt91 · December 3, 2019, 4:54pm

I’ve set the the oscillator to do all the unison voices (not synth voices) in the same tick, so I thought about calculating all the the 8 unison with SIMD and then just summing the active ones to the output. I’ve profiled it so I know for sure the linear interpolation is the most CPU consuming part by far (54% of time) followed by accessing the wavetables. But I’m not getting any benefit from doing it in SIMD + release build rather than just release build.

Original:
Tick

for (int n=0; n<unisonVoices; n++)
{
//... phase and wavetable access managing stuff
tmpOut[u] = x0[n] + fraccional[n] * (x1[n] - x0[n]); 
}
// ... more code summing the tmpOut to the output

SIMD:
Variables:

    alignas (16) float fraccional4[4];
    alignas (16) float fraccional8[4];
    alignas (16) float x04[4];
    alignas (16) float x08[4];
    alignas (16) float x14[4];
    alignas (16) float x18[4];
    alignas (16) float tmpOut4[4];
    alignas (16) float tmpOut8[4];

Tick:

    SIMDRegister<float> simd_x04 = dsp::SIMDRegister<float>::fromRawArray(x04);
    SIMDRegister<float> simd_x08 = dsp::SIMDRegister<float>::fromRawArray(x08);
    SIMDRegister<float> simd_x14 = dsp::SIMDRegister<float>::fromRawArray(x14);
    SIMDRegister<float> simd_x18 = dsp::SIMDRegister<float>::fromRawArray(x18);
        
    SIMDRegister<float> simd_tmpOut4 = simd_x04 + simd_fraccional4 * (simd_x14-simd_x04);
    SIMDRegister<float> simd_tmoOut8 = simd_x08 + simd_fraccional8 * (simd_x18-simd_x08);

    simd_tmpOut4.copyToRawArray(tmpOut4);
    simd_tmpOut8.copyToRawArray(tmpOut8);

    // ... more code summing the tmpOut4 and tmpOut8 to the output

Also tried with multiplyAdd but the result is the same.

Profiling results:
Original tick code (Release build): 57.5% of time. 3 most consuming instructions: movl (13.79%), movl (5.56%), movss (4.73%). The rest (maths included) are below 3%.
SIMD tick code (Release build): 54.4% of time. 3 most consuming instructions: movl (8.07%), movaps (6.27%), movq (4.08%). The rest (maths included) are 2% and below.

I don’t know a lot about assembly code but seems like moving registers are the most expensive operations. Is memory access the bottleneck here? Or is the compiler just vectorizing well enough so the maths aren’t the higher cost anymore?

jimc · December 3, 2019, 5:04pm

Moving registers should be cheap.

Maybe it was already optimised to SIMD by the compiler? Check disassembly of non-SIMD version for SIMD instructions??

jimc · December 3, 2019, 5:06pm

https://www.agner.org/optimize/ generally useful.

johngalt91 · December 3, 2019, 5:23pm

Thanks for the resources.
It seems it’s using SIMD (SSE) instructions like mulps, movaps, etc. but those have are a really low impact on CPU.

I’m afraid the problem is the cache locality, since it’s an FM synth and I can’t render the whole buffer size of samples, I can only render 1 tick of each oscillator at a time to modulate the others oscillators (or can I?). When I was doing the whole buffer at a time for a normal substractive synth it improved a ton, and I guess SIMD would benefit there much more than only doing 8 operations in 2 SIMD registers each tick.

DaveH · December 3, 2019, 9:02pm

Have you tried rendering each oscillator block one at a time, and referencing the stored output for your FM work? I’ve had quite the success of storing LFO outputs and referring to them later.
The cache doesn’t get hit much and it simplifies a lot of the synthesis process.

Topic		Replies	Views
SIMDRegister is it worth it? General JUCE discussion	6	2010	November 4, 2022
Using SIMDRegister is slower than regular multiplications ... Any way to perhaps make it useful? General JUCE discussion	4	450	January 23, 2024
Simplest way to use SIMD for basic float multiplication/addition? General JUCE discussion	5	628	January 22, 2024
SIMDRegister - How do I do the equivalent of General JUCE discussion	17	3040	June 23, 2018
Getting started using SIMD - basic question regarding fromRawArray / copyToRawArray General JUCE discussion	1	1053	November 27, 2018

My SIMD code barely improves performance

Purchase

Discover

Learn

Support

About

Events

My SIMD code barely improves performance

Related topics

Purchase

Discover

Learn

Support

About

Events