I’m just starting to figure out where in my codebase SIMD usage could help me getting better performance. While I understand the basic idea of SIMD I’m not sure on how to understand what the dsp:: SIMDRegister or the underlying platform specific SIMD types really express.
Starting with some conventional code snippet
float a, b, c;
c = a + b;
I’m pretty sure that this should lead to a series of processor instructions like this:
- Load the content of
afrom some RAM location into a CPU register - Load the content of
bfrom some RAM location into another CPU register - Add the content of both registers in the ALU and store it in a third CPU register
- Store the result from this third register to the RAM location of
c
Now lets say I have arrays of four floats instead of scalars and first of all use something easy to use like FloatVectorOperations::add to add them up using vector operations. Then I’d write
float a[4], b[4], c[4];
FloatVectorOperations::add (c, a, b, 4);
I believe after the compiler has performed all inlining optimizations this should lead to a series of processor instructions like this (assumed the arrays are perfectly aligned for SIMD usage):
- Load the four floats of
afrom some RAM location into a CPU SIMD register - Load the four floats of
bfrom some RAM location into another CPU SIMD register - Add the content of both SIMD registers in the CPU’s SIMD unit and store it in a third CPU SIMD register
- Store the results from this third register to the RAM location of the array
c
Am I right until this point?
Assumed I am right, let’s perform the above task above using dsp::SIMDRegister - at least how I am understanding it right now, assuming dsp::SIMDRegister<float>::SIMDNumElements equals four on my target architecture (and still assuming the arrays a, b and c are perfectly aligned for SIMD usage)
float a[4], b[4], c[4];
auto aSIMDReg = dsp::SIMDRegister<float>::fromRawArray (a);
auto bSIMDReg = dsp::SIMDRegister<float>::fromRawArray (b);
auto cSIMDReg = aSIMDReg + bSIMDReg;
cSIMDReg.copyToRawArray (c);
Am I right in my point of view that the whole thing about the SIMDRegister is to explicitly express the register loading/saving in the code stuff that CPUs always do and that there is no more memory copy overhead involved when using functions like fromRawArray and especially copyToRawArray? Or is there any more memory copy overhead involved compared to usual scalar operations that I should be aware of, which could make simple SIMD operations more “expensive” comprared to scalar operations in some cases even if working on vectorized data?
I hope you get my question
Thanks in advance for clearing this up
