SIMDRegister - How do I do the equivalent of

nammick · June 19, 2018, 2:52pm

I have got my head around Intel Intrinsics but would like to how to do the equivalent of simple operations with SIMDRegister

Take:

    __m128 v = _mm_setr_ps(0,2.2,1.3,19.9);
    __m128 p = _mm_set1_ps(2.3);
    __m128 u = _mm_add_ps(v , p);
    
    float e[4];
    _mm_store_ps(e,u);
    
    DBG(e[1]);

How would I replicate with SIMDRegister?

ujam · June 20, 2018, 7:00am

Not the answer to your question, but I think you either need to force good alignment of e[] or use _mm_storeu_ps (store unaligned)

nammick · June 20, 2018, 7:32am

In code I’m pulling stuff in from a float array and initialising it with align as

alignas(32) float * samples = nullptr;

Though from reading up the bleeding edge case is

samples = static_cast<float*>(std::aligned_alloc(32, numChannels * numDomains * numSamples * 4));

Which I may also start to use.

I have played around with creating wrapper class more inline with the context I want to use but if juce handles it out the box then no point reinventing wheel. I just haven’t got my head around the underlying structure and aligned or not the above is just for a like for like context.

Checking memory alignment I am using this assertion

jassert((unsigned long)std::addressof(samples) % 32 == 0);

martinrobinson-2 · June 20, 2018, 8:04am

Something like this:

using SIMDFloat = SIMDRegister<float>;

alignas (16) float vraw[] = { 0.0f, 2.2f, 1.3f, 19.9f };
 
SIMDFloat v = SIMDFloat::fromRawArray (vraw);
SIMDFloat p (2.3f);
SIMDFloat u = v + p;

alignas (16) float eraw[4];

u.copyToRawArray (eraw);

DBG (eraw[1]);

fr810 · June 20, 2018, 8:08am

Depending on which compiler you use, you can also specify a native SIMD literal with an initialiser list. This allows you to do the following:

using SIMDFloat = SIMDRegister<float>;

SIMDFloat v = SIMDFloat::fromNative ({ 0.0f, 2.2f, 1.3f, 19.9f });

Sadly, not all compilers support this.

jules · June 20, 2018, 8:22am

Just for the record, although people are tolerent of quite messy looking SIMD code because so much of it looks like that, it doesn’t have to be so bad if you use a more modern style, e.g. the example above

using SIMDFloat = SIMDRegister<float>;
alignas (16) float vraw[] = { 0.0f, 2.2f, 1.3f, 19.9f };
SIMDFloat v = SIMDFloat::fromRawArray (vraw);
SIMDFloat p (2.3f);
SIMDFloat u = v + p;
alignas (16) float eraw[4];
u.copyToRawArray (eraw);
DBG (eraw[1]);

could be written as simply as this:

auto v = juce::dsp::SIMDRegister<float>::fromNative ({ 0.0f, 2.2f, 1.3f, 19.9f });
auto u = v + 2.3f;
DBG (u.get(1));

…or even as a one-liner:

DBG ((juce::dsp::SIMDRegister<float>::fromNative ({ 0.0f, 2.2f, 1.3f, 19.9f })
       + 2.3f).get(1));

(although that’s a bit too terse for even my taste…)

martinrobinson-2 · June 20, 2018, 8:44am

Or even better

DBG (4.5f);

nammick · June 20, 2018, 8:52am

Ha… (t’was supposed to mundane example)

Cheers for the examples. I was trying to draw some parallels from the XSIMD lib which makes sense to me. I assume I can use SIMDFloat 8 floats AVX aswell.

Thanks again

fr810 · June 20, 2018, 8:57am

Yes you can you just need to enable avx2 at compile-time (for example -mavx2 for clang and gcc).

nammick · June 20, 2018, 8:57am

is there any cost to accessing with .get(x)?

I had wrapped something like that up myself but handy that its in there

jules · June 20, 2018, 8:58am

There’s always some cost, but I assume it’ll be pretty low.

fr810 · June 20, 2018, 8:58am

It will boil down to a single machine instruction in release mode.

ncthom · June 22, 2018, 6:29pm

This thread has been really illuminating for me; I was very confused about the SIMDRegister class before this, so thank you! I’d like to tack on another “How do I do X” on the thread since it seems the previous question has an answer:

SIMDRegister<float> WavetableOscillator::tick(SIMDRegister<float> phase) {
   int index = static_cast<int>(phase * kTableSize)
   int rightIndex = (index + 1 ) & kTableMask;
   float frac = phase * (float) kTableSize - (float) index;
   return lerp(frac, m_table[index], m_table[rightIndex]);
}

This is a contrived example, but the point stands: the line int index = static_cast<int>(phase * kTableSize) will fail to compile because you can’t cast an __mm128 to int, for example. But if I have this sort of procedure where I want to cast some float type to int for maybe a table lookup with linear interpolation, can it still be vectorized with SIMDRegister?

nammick · June 22, 2018, 7:44pm

My heads been in the vectorising space the last week also trying to implement some really heavy Poly and Linear phase filter banks.

From what I can you don’t want to mix operations SIMDRegister and SIMDRegister with operators as it will be more costly than just plain code. There’s no functions in intel intrinsics to do this + I don’t think you can cast one type to another.

Is kTableSize just an int as in the code there seems to be no benefit to having phase as a SIMD object. I know that SIMDRegister has got the get overload and the [] shorthand.

int index = static_cast<int>(phase[0] * kTableSize);
// or
int index = static_cast<int>(phase.get(0) * kTableSize);

would work…

Other thing I did note aswell using juce SIMD classes compared to others was that

juce::dsp::SIMDRegister<float> f(2.0);
auto g = f * 8.0f; // works
auto h = 8.0f * f; // doesn't

Would be good to have operators for both left hand and right hand side as it can take a little rearranging to get code in the right form which may not be as easy to understand what a formula or also does.

I will let Fabian or Jules answer your question as I’m only levelled up to about a green belt with SIMD and still got quite a bit to learn.

chrisboy2000 · June 22, 2018, 8:24pm

I also made the experience that you need at least a black belt in SIMD tuning to get stuff like linear interpolation faster than it’s naive implementation.

The thing with linear interpolation is that it can have any stride factor resulting in unpredictable memory access, which is the bottleneck. The raw mathematics are so trivial that it doesn’t matter whether they are SIMDed or not.

But I also started using this class recently and I got the best performance increases by replacing repetitive calls to FloatVectorOperations functions with tight loops that operate on the hot data SIMD-style:

// This
FloatVectorOperations::multiply(data, 2.0f, numSamples);
FloatVectorOperations::add(data, otherData, numSamples);
FloatVectorOperations::multiply(data, -1.0f, numSamples);

// becomes something like this
using SSEType = dsp::SIMDRegister<float>;

int numLoop = numSamples / (SSEType::RegisterSize);

while(--numLoop >= 0)
{
    auto a = SSEType::fromRawArray(data);
    auto b = SSEType::fromRawArray(otherData);
    a *= b * SSEType::expand(-1,0f);
    a.copyToRawArray(data);
    data += SSEType::RegisterSize;
    otherData += SSEType::RegisterSize;
}

Another thing I noticed regarding the SIMDRegister class is that it requires the AVX2 compiler flag in order to use the AVX register size. This is a pretty tight requirement since they are a lot of CPUs that don’t have AVX2 (I am typing this without AVX for example), but AVX should be the lowest common denominator by now (I think all CPUs since 2011 support it and who uses anything older for serious audio stuff).

Are there some things missing in the AVX instruction set or is there another reason for this decision?

#ifdef __AVX2__
#include "native/juce_avx_SIMDNativeOps.h"
#else

ncthom · June 22, 2018, 8:54pm

IanKnowles:

Is kTableSize just an int as in the code there seems to be no benefit to having phase as a SIMD object. I know that SIMDRegister has got the get overload and the shorthand.
int index = static_cast<int>(phase[0] * kTableSize);
// or
int index = static_cast<int>(phase.get(0) * kTableSize);

So kTableSize in this example is just an int. Essentially what I’m trying to do (which is how I got to this example) is to process LFO modulators in parallel (i.e. each channel in a stereo channel pair can have its own LFO modulator). I’d like to interleave the stereo buffer and then operate on both the L/R channels via SIMDRegister. I think most of my code should be set up well for that, but I have 2 cases like the above where I need the float type to map to an array lookup.

nammick · June 23, 2018, 6:43am

I’m not sure you can use SIMD in this way. If you using SIMD object as a LUT then you can only have one lookup per SIMD object. You can’t index individual rows across multiple SIMD objects and mix them into one object in a single operation.

Now I get what you doing though I would say just bite the bullet and process each channel seperately and feed in each value to two new SIMD objects which you can then lerp between as this is going to be the most expensive computationally anyway.

   auto index = (phase * kTableSize)
   auto rightIndex = (index + 1 ) & kTableMask;

   alignas(16) float frac[4];
   frac[0] = phase * (float) kTableSize - (float) index[0];
   ...
   frac[3] = phase * (float) kTableSize - (float) index[3];


   auto fracSimd = SIMDRegister<float>::fromRawArray(frac);

// do the same with m_table
   alignas(16) float m_table_left[4];
alignas(16) float m_table_right[4];
...
... // fill table values
...
auto m_table_left_Simd = SIMDRegister<float>::fromRawArray(m_table_left);
auto m_table_right_Simd = SIMDRegister<float>::fromRawArray(m_table_right);

   return lerp(frac, m_table_left_Simd, m_table_right_Simd);

Then profile it to make sure your getting the boost

martinrobinson-2 · June 23, 2018, 9:00am

Have you seen Angus’ talk from ADC2017?

Topic		Replies	Views
Simplest way to use SIMD for basic float multiplication/addition? General JUCE discussion	5	663	January 22, 2024
An SIMD Question General JUCE discussion	11	1219	November 7, 2018
SIMDRegister - feedback and questions General JUCE discussion	5	775	June 2, 2021
Understanding SIMDRegister usage General JUCE discussion	4	1823	March 17, 2018
Getting started using SIMD - basic question regarding fromRawArray / copyToRawArray General JUCE discussion	1	1081	November 27, 2018

SIMDRegister - How do I do the equivalent of

Purchase

Discover

Learn

Support

About

Events

SIMDRegister - How do I do the equivalent of

Related topics

Purchase

Discover

Learn

Support

About

Events