Getting started using SIMD - basic question regarding fromRawArray / copyToRawArray

PluginPenguin · November 22, 2018, 10:36pm

I’m just starting to figure out where in my codebase SIMD usage could help me getting better performance. While I understand the basic idea of SIMD I’m not sure on how to understand what the dsp:: SIMDRegister or the underlying platform specific SIMD types really express.

Starting with some conventional code snippet

float a, b, c;
c = a + b;

I’m pretty sure that this should lead to a series of processor instructions like this:

Load the content of a from some RAM location into a CPU register
Load the content of b from some RAM location into another CPU register
Add the content of both registers in the ALU and store it in a third CPU register
Store the result from this third register to the RAM location of c

Now lets say I have arrays of four floats instead of scalars and first of all use something easy to use like FloatVectorOperations::add to add them up using vector operations. Then I’d write

float a[4], b[4], c[4];
FloatVectorOperations::add (c, a, b, 4);

I believe after the compiler has performed all inlining optimizations this should lead to a series of processor instructions like this (assumed the arrays are perfectly aligned for SIMD usage):

Load the four floats of a from some RAM location into a CPU SIMD register
Load the four floats of b from some RAM location into another CPU SIMD register
Add the content of both SIMD registers in the CPU’s SIMD unit and store it in a third CPU SIMD register
Store the results from this third register to the RAM location of the array c

Am I right until this point?

Assumed I am right, let’s perform the above task above using dsp::SIMDRegister - at least how I am understanding it right now, assuming dsp::SIMDRegister<float>::SIMDNumElements equals four on my target architecture (and still assuming the arrays a, b and c are perfectly aligned for SIMD usage)

float a[4], b[4], c[4];
auto aSIMDReg = dsp::SIMDRegister<float>::fromRawArray (a);
auto bSIMDReg = dsp::SIMDRegister<float>::fromRawArray (b);
auto cSIMDReg = aSIMDReg + bSIMDReg;
cSIMDReg.copyToRawArray (c);

Am I right in my point of view that the whole thing about the SIMDRegister is to explicitly express the register loading/saving in the code stuff that CPUs always do and that there is no more memory copy overhead involved when using functions like fromRawArray and especially copyToRawArray? Or is there any more memory copy overhead involved compared to usual scalar operations that I should be aware of, which could make simple SIMD operations more “expensive” comprared to scalar operations in some cases even if working on vectorized data?

I hope you get my question Thanks in advance for clearing this up

t0m · November 27, 2018, 11:16am

The short answer is that when you let the compiler optimise things you can no longer make such simple assumptions as those. For example the compiler may choose to leave things in a register even if it looks like you’ll be “copying” data. In all cases like these you should either profile or look at the assembly generated using something like Compiler Explorer - guessing from a higher level is difficult.

Topic		Replies	Views
SIMDRegister - How do I do the equivalent of General JUCE discussion	17	3040	June 23, 2018
SIMDRegister is it worth it? General JUCE discussion	6	2010	November 4, 2022
Simplest way to use SIMD for basic float multiplication/addition? General JUCE discussion	5	628	January 22, 2024
Using SIMDRegister is slower than regular multiplications ... Any way to perhaps make it useful? General JUCE discussion	4	450	January 23, 2024
My SIMD code barely improves performance General JUCE discussion	4	1162	December 3, 2019

Getting started using SIMD - basic question regarding fromRawArray / copyToRawArray

Purchase

Discover

Learn

Support

About

Events

Getting started using SIMD - basic question regarding fromRawArray / copyToRawArray

Related topics

Purchase

Discover

Learn

Support

About

Events