Simplest way to use SIMD for basic float multiplication/addition?

I read the tutorial here but I unfortunately couldn’t get anything usable out of it.

I am trying to just simply improve the optimization of one of my more computationally expensive synths. I must do an enormous amount of multiplications/additions per sample as it is doing some advanced physical modeling.

As I understand it, SIMD allows you to add/multiply up to 4 floats or doubles for the cost of one, right? Is it system dependent how many you can do at once?

What is the absolute simplest code that might allow me to do this hypothetically with SIMD (random data here for example):

//ALREADY EXISTING DATA IN VARIABLES TO WORK WITH
float set1A = 20;
float set1B = 10;
float set1C = 2;
float set1D = 8;

float set2A = 5;
float set2B = 30;
float set2C = 80;
float set2D = 6;

//NEED TO REPLACE WITH SIMD TO GO FASTER AND ASSIGN RESULTS TO THOSE VARIABLE NAMES:
float resultA = set1A * set2A;
float resultB = set1B * set2B;
float resultC = set1C * set2C;
float resultD = set1D * set2D;

Surely there is some easy way to take advantage of the SIMD function to replace the multiplication operations and not make life too hard?

For example, according to this in .NET it would be as simple as putting the data into a Vector4 and using the dot function, followed by reassigning your output variables from that:

Vector4 set1 = new Vector4(set1A, set1B, set1C, set1D);
Vector4 set2 = new Vector4(set2A, set2B, set2C, set2D);
Vector4 result = Vector4.Dot(set1,set2);

float resultA = result.W;
float resultB = result.X;
float resultC = result.Y;
float resultD = result.Z;

This is obvious, simple, easy, and intuitive. Whereas I can’t make any sense of the JUCE equivalent, if one exists.

If the JUCE method is inherently cumbersome, I wonder if it is worth implementing a third party library like this one or is there any other you would recommend?

Thanks for any help.

The JUCE equivalent is juce::SIMDRegister<float> instead of your Vector4. Instead of Vector4.Dot() you simply keep using the * operator.

And yes, it’s system-dependent how many floats / doubles you can do at once. There are different instruction sets (SSE, AVX, etc for Intel, Neon for Arm). But a SIMD abstraction like JUCE’s SIMDRegister hides that from you.

1 Like

Thanks. I didn’t realize SIMDRegister like that has 4 floats in it. Based on your tip, I found a simple example of how to use it here:

I will try that soon.

The funny thing about all this is it requires you to:

  • create a new (aligned) array of the primitives (floats)
  • put that array into a new class to get it into the SIMD format.
  • do this twice if you need to multiply two sets of data
  • multiply the SIMD types (only step saving operations)
  • create a new (aligned) array to copy the data to or run a get function for each

It seems surprising to me this is any better than just multiplying but I guess multiplying is still much more expensive than shuffling all these primitives around.

Also, in the example given, is the alignas (16) necessary? Eg.

alignas (16) float eraw[4];
u.copyToRawArray (eraw);

What would happen if you didn’t have the alignas? Would it just fail?

I notice Jules’ code skips all this completely. Is it likely less/more or the same efficiency to use:

float val0 = u.get(0);
float val1 = u.get(1);
float val2 = u.get(2);
float val3 = u.get(3);

Thanks for any further thoughts.

IIRC, if you do an aligned load on an unaligned array, then it will either crash or be slower than it needs to be, depending on the CPU that’s being used.

If memory is allocated on the heap it might already be properly aligned (again depending on the CPU) but on the stack there’s no guarantee how your local variables are aligned, hence the alignas.

Note that this same shuffling of data also happens in the .NET Vector4 classes, it’s just less obvious.

1 Like

Thanks. I guess more specifically then I’m wondering about these two methods:

Option 1:

using SIMDFloat = SIMDRegister<float>;
alignas (16) float vraw[] = { 0.0f, 2.2f, 1.3f, 19.9f };
SIMDFloat v = SIMDFloat::fromRawArray (vraw);
SIMDFloat p (2.3f);
SIMDFloat u = v + p;
alignas (16) float eraw[4];
u.copyToRawArray (eraw);

Option 2:

auto v = juce::dsp::SIMDRegister<float>::fromNative ({ 0.0f, 2.2f, 1.3f, 19.9f });
auto u = v + 2.3f;
DBG (u.get(1));

In #1, we align the array before putting it in, and receive it out as an aligned array. I must presume then that fromNative creates an aligned array out of the nonaligned data supplied internally then?

And the get function either gets it from an internally aligned basic array that is again generated or from the SIMDRegister directly perhaps?

Probably these must be equivalent also in some way. Thanks agin.

fromNative ({ 0.0f, 2.2f, 1.3f, 19.9f }) directly creates a __m128 object (on Intel SSE), which is automatically aligned already.

fromRawArray() does a load operation of some memory array, which is not necessarily aligned, into a __m128 object.

It might be useful to learn how to use compiler intrinsics for SIMD directly, just so you get a sense of how this works under the hood.