Getting started using SIMD - basic question regarding fromRawArray / copyToRawArray


#1

I’m just starting to figure out where in my codebase SIMD usage could help me getting better performance. While I understand the basic idea of SIMD I’m not sure on how to understand what the dsp:: SIMDRegister or the underlying platform specific SIMD types really express.

Starting with some conventional code snippet

float a, b, c;
c = a + b;

I’m pretty sure that this should lead to a series of processor instructions like this:

  • Load the content of a from some RAM location into a CPU register
  • Load the content of b from some RAM location into another CPU register
  • Add the content of both registers in the ALU and store it in a third CPU register
  • Store the result from this third register to the RAM location of c

Now lets say I have arrays of four floats instead of scalars and first of all use something easy to use like FloatVectorOperations::add to add them up using vector operations. Then I’d write

float a[4], b[4], c[4];
FloatVectorOperations::add (c, a, b, 4);

I believe after the compiler has performed all inlining optimizations this should lead to a series of processor instructions like this (assumed the arrays are perfectly aligned for SIMD usage):

  • Load the four floats of a from some RAM location into a CPU SIMD register
  • Load the four floats of b from some RAM location into another CPU SIMD register
  • Add the content of both SIMD registers in the CPU’s SIMD unit and store it in a third CPU SIMD register
  • Store the results from this third register to the RAM location of the array c

Am I right until this point?

Assumed I am right, let’s perform the above task above using dsp::SIMDRegister - at least how I am understanding it right now, assuming dsp::SIMDRegister<float>::SIMDNumElements equals four on my target architecture (and still assuming the arrays a, b and c are perfectly aligned for SIMD usage)

float a[4], b[4], c[4];
auto aSIMDReg = dsp::SIMDRegister<float>::fromRawArray (a);
auto bSIMDReg = dsp::SIMDRegister<float>::fromRawArray (b);
auto cSIMDReg = aSIMDReg + bSIMDReg;
cSIMDReg.copyToRawArray (c);

Am I right in my point of view that the whole thing about the SIMDRegister is to explicitly express the register loading/saving in the code stuff that CPUs always do and that there is no more memory copy overhead involved when using functions like fromRawArray and especially copyToRawArray? Or is there any more memory copy overhead involved compared to usual scalar operations that I should be aware of, which could make simple SIMD operations more “expensive” comprared to scalar operations in some cases even if working on vectorized data?

I hope you get my question :wink: Thanks in advance for clearing this up


#2

The short answer is that when you let the compiler optimise things you can no longer make such simple assumptions as those. For example the compiler may choose to leave things in a register even if it looks like you’ll be “copying” data. In all cases like these you should either profile or look at the assembly generated using something like Compiler Explorer - guessing from a higher level is difficult.