SIMDRegister - How do I do the equivalent of


#1

I have got my head around Intel Intrinsics but would like to how to do the equivalent of simple operations with SIMDRegister

Take:

    __m128 v = _mm_setr_ps(0,2.2,1.3,19.9);
    __m128 p = _mm_set1_ps(2.3);
    __m128 u = _mm_add_ps(v , p);
    
    float e[4];
    _mm_store_ps(e,u);
    
    DBG(e[1]);

How would I replicate with SIMDRegister?


#2

Not the answer to your question, but I think you either need to force good alignment of e[] or use _mm_storeu_ps (store unaligned)


#3

In code I’m pulling stuff in from a float array and initialising it with align as

alignas(32) float * samples = nullptr;

Though from reading up the bleeding edge case is

samples = static_cast<float*>(std::aligned_alloc(32, numChannels * numDomains * numSamples * 4));

Which I may also start to use.

I have played around with creating wrapper class more inline with the context I want to use but if juce handles it out the box then no point reinventing wheel. I just haven’t got my head around the underlying structure and aligned or not the above is just for a like for like context.

Checking memory alignment I am using this assertion

jassert((unsigned long)std::addressof(samples) % 32 == 0);

#4

Something like this:

using SIMDFloat = SIMDRegister<float>;

alignas (16) float vraw[] = { 0.0f, 2.2f, 1.3f, 19.9f };
 
SIMDFloat v = SIMDFloat::fromRawArray (vraw);
SIMDFloat p (2.3f);
SIMDFloat u = v + p;

alignas (16) float eraw[4];

u.copyToRawArray (eraw);

DBG (eraw[1]);

#5

Depending on which compiler you use, you can also specify a native SIMD literal with an initialiser list. This allows you to do the following:

using SIMDFloat = SIMDRegister<float>;

SIMDFloat v = SIMDFloat::fromNative ({ 0.0f, 2.2f, 1.3f, 19.9f });

Sadly, not all compilers support this.


#6

Just for the record, although people are tolerent of quite messy looking SIMD code because so much of it looks like that, it doesn’t have to be so bad if you use a more modern style, e.g. the example above

using SIMDFloat = SIMDRegister<float>;
alignas (16) float vraw[] = { 0.0f, 2.2f, 1.3f, 19.9f };
SIMDFloat v = SIMDFloat::fromRawArray (vraw);
SIMDFloat p (2.3f);
SIMDFloat u = v + p;
alignas (16) float eraw[4];
u.copyToRawArray (eraw);
DBG (eraw[1]);

could be written as simply as this:

auto v = juce::dsp::SIMDRegister<float>::fromNative ({ 0.0f, 2.2f, 1.3f, 19.9f });
auto u = v + 2.3f;
DBG (u.get(1));

…or even as a one-liner:

DBG ((juce::dsp::SIMDRegister<float>::fromNative ({ 0.0f, 2.2f, 1.3f, 19.9f })
       + 2.3f).get(1));

(although that’s a bit too terse for even my taste…)


#7

Or even better

DBG (4.5f);

:stuck_out_tongue_winking_eye:


#8

Ha… (t’was supposed to mundane example)

Cheers for the examples. I was trying to draw some parallels from the XSIMD lib which makes sense to me. I assume I can use SIMDFloat 8 floats AVX aswell.

Thanks again


#9

Yes you can you just need to enable avx2 at compile-time (for example -mavx2 for clang and gcc).


#10

is there any cost to accessing with .get(x)?

I had wrapped something like that up myself but handy that its in there


#11

There’s always some cost, but I assume it’ll be pretty low.


#12

It will boil down to a single machine instruction in release mode.


#13

This thread has been really illuminating for me; I was very confused about the SIMDRegister class before this, so thank you! I’d like to tack on another “How do I do X” on the thread since it seems the previous question has an answer:

SIMDRegister<float> WavetableOscillator::tick(SIMDRegister<float> phase) {
   int index = static_cast<int>(phase * kTableSize)
   int rightIndex = (index + 1 ) & kTableMask;
   float frac = phase * (float) kTableSize - (float) index;
   return lerp(frac, m_table[index], m_table[rightIndex]);
}

This is a contrived example, but the point stands: the line int index = static_cast<int>(phase * kTableSize) will fail to compile because you can’t cast an __mm128 to int, for example. But if I have this sort of procedure where I want to cast some float type to int for maybe a table lookup with linear interpolation, can it still be vectorized with SIMDRegister?


#14

My heads been in the vectorising space the last week also trying to implement some really heavy Poly and Linear phase filter banks.

From what I can you don’t want to mix operations SIMDRegister and SIMDRegister with operators as it will be more costly than just plain code. There’s no functions in intel intrinsics to do this + I don’t think you can cast one type to another.

Is kTableSize just an int as in the code there seems to be no benefit to having phase as a SIMD object. I know that SIMDRegister has got the get overload and the [] shorthand.

int index = static_cast<int>(phase[0] * kTableSize);
// or
int index = static_cast<int>(phase.get(0) * kTableSize);

would work…

Other thing I did note aswell using juce SIMD classes compared to others was that

juce::dsp::SIMDRegister<float> f(2.0);
auto g = f * 8.0f; // works
auto h = 8.0f * f; // doesn't

Would be good to have operators for both left hand and right hand side as it can take a little rearranging to get code in the right form which may not be as easy to understand what a formula or also does.

I will let Fabian or Jules answer your question as I’m only levelled up to about a green belt with SIMD and still got quite a bit to learn.


#15

I also made the experience that you need at least a black belt in SIMD tuning to get stuff like linear interpolation faster than it’s naive implementation.

The thing with linear interpolation is that it can have any stride factor resulting in unpredictable memory access, which is the bottleneck. The raw mathematics are so trivial that it doesn’t matter whether they are SIMDed or not.

But I also started using this class recently and I got the best performance increases by replacing repetitive calls to FloatVectorOperations functions with tight loops that operate on the hot data SIMD-style:

// This
FloatVectorOperations::multiply(data, 2.0f, numSamples);
FloatVectorOperations::add(data, otherData, numSamples);
FloatVectorOperations::multiply(data, -1.0f, numSamples);

// becomes something like this
using SSEType = dsp::SIMDRegister<float>;

int numLoop = numSamples / (SSEType::RegisterSize);

while(--numLoop >= 0)
{
    auto a = SSEType::fromRawArray(data);
    auto b = SSEType::fromRawArray(otherData);
    a *= b * SSEType::expand(-1,0f);
    a.copyToRawArray(data);
    data += SSEType::RegisterSize;
    otherData += SSEType::RegisterSize;
}

Another thing I noticed regarding the SIMDRegister class is that it requires the AVX2 compiler flag in order to use the AVX register size. This is a pretty tight requirement since they are a lot of CPUs that don’t have AVX2 (I am typing this without AVX for example), but AVX should be the lowest common denominator by now (I think all CPUs since 2011 support it and who uses anything older for serious audio stuff).

Are there some things missing in the AVX instruction set or is there another reason for this decision?

#ifdef __AVX2__
#include "native/juce_avx_SIMDNativeOps.h"
#else

#16

So kTableSize in this example is just an int. Essentially what I’m trying to do (which is how I got to this example) is to process LFO modulators in parallel (i.e. each channel in a stereo channel pair can have its own LFO modulator). I’d like to interleave the stereo buffer and then operate on both the L/R channels via SIMDRegister. I think most of my code should be set up well for that, but I have 2 cases like the above where I need the float type to map to an array lookup.


#17

I’m not sure you can use SIMD in this way. If you using SIMD object as a LUT then you can only have one lookup per SIMD object. You can’t index individual rows across multiple SIMD objects and mix them into one object in a single operation.

Now I get what you doing though I would say just bite the bullet and process each channel seperately and feed in each value to two new SIMD objects which you can then lerp between as this is going to be the most expensive computationally anyway.

   auto index = (phase * kTableSize)
   auto rightIndex = (index + 1 ) & kTableMask;

   alignas(16) float frac[4];
   frac[0] = phase * (float) kTableSize - (float) index[0];
   ...
   frac[3] = phase * (float) kTableSize - (float) index[3];


   auto fracSimd = SIMDRegister<float>::fromRawArray(frac);

// do the same with m_table
   alignas(16) float m_table_left[4];
alignas(16) float m_table_right[4];
...
... // fill table values
...
auto m_table_left_Simd = SIMDRegister<float>::fromRawArray(m_table_left);
auto m_table_right_Simd = SIMDRegister<float>::fromRawArray(m_table_right);

   return lerp(frac, m_table_left_Simd, m_table_right_Simd);

Then profile it to make sure your getting the boost


#18

Have you seen Angus’ talk from ADC2017?