Understanding SIMDRegister usage


#1

Knowing very little about SIMD, I thought I’d take a look at the new tutorials (thank you!) and see how it could help some of my code independently of the dsp module. It looks like a platform independent wrapper similar to the FloatVectorOperations…which looks cool.

The tutorial shows an example of modifying a simple function using SIMDRegister wrappers. This:

float calculateDSPEffect (float x, float y)
{
    auto z = x + (y * 2.0f);
    return z;
}

into this:

SIMDRegister<float> calculateDSPEffect (SIMDRegister<float> x,
                                        SIMDRegister<float> y)
{
    auto z = x + (y * 2.0f);
    return z;
}

So I tried a very simple test case and got some errors:

void compute(int numSamples, float* inBuffer, float* outBuffer)
{
	for(int i = 0; i < numSamples; ++i)
	{
		dsp::SIMDRegister<float>x = inBuffer[i];
		x = x / 2.0f;  //Invalid operands to binary expression ('dsp::SIMDRegister<float>' and 'float')
		outBuffer[i] = x; //Assigning to 'float' from incompatible type 'dsp::SIMDRegister<float>'
	}
}

I looked in the SIMDRegister class and didn’t see a division operator and I also couldn’t find a way to convert it back to a float. So, even with my incredibly simple test case, I’m stumped. Anyone?


#2

It doesn’t work because the SIMDRegister object is basically an array of 4 floats. So you need to fill the 4 values of your x variable, and iterate in your for loops every 4 samples as well. About the division, maybe that specific operation isn’t available, but you can replace it with “* 0.25f”.


#3

Thanks @IvanC ! “the SIMDRegister object is basically an array of 4 floats” really is an eye opener for me.
Question: Was looking at the same tutorial and saw the interleaving part in the AudioBlock: is interleaving always a better strategy for these optimizations? Could a SIMD vectorization be equivalent if done on the two channels sequentially?

Asking because I really can’t get to like programming my process blocks thinking in interleaved fashion :smiley:


#4

Hello @fefanto !

For the interleaving, it depends on what kind of optimization you want to achieve. Don’t forget that there is no magic at all in SIMDRegister class and SIMD vectorization in general :wink: Sometimes, a process is just impossible to “parallelize”, mostly in cases where the result for the sample n+1 depends on the result for the sample n. IIR filters processing is the most obvious case for such an issue.

So what is still possible there is to parallelize multi-channel processing using SIMD, but not the process itself for a given single channel. To do that, you need to find the most efficient way to get that SIMDRegister variable filled, so you can apply your operations. For a IIRFilter, OK you can’t parallelize one process, but if you need to do it 4 times for every sample (think multi-channel processing of course but also parallel filtering, for example with N*4 bandpass filters in parallel for every sample). The obvious way is to create one SIMDRegister at every for loop iteration, and then fill its content, but you’re going to do a lot of memory operations this way.

A better way would be to access directly to some already aligned audio data with a pointer, and then process that directly as well. AudioBuffer object data is already aligned if it’s possible, but unfortunately for multi-channel vectorization the data is ordered sequentially with regards to samples. You have your N samples for channel 0, then your N samples for channel 1 etc. That’s where the interleaving algorithm is important, since it allows with a CPU cost as low as possible thanks to associated optimizations to organize the data another way prior to processing, with 4 samples 0 for channels 0-3, then 4 samples 1 etc.

Processing them this way allows you to do the multi-channel processing vectorization, and to use the SIMDRegister operations, which become handy since thanks to it you can write your DSP algorithm the exact same way you would do it for float variables with + and * operators without changing anything. And of course you would need to do the inverse interleaving at the end of the function to return the result with the right organization.

It’s something quite usual in SIMD development to change the organization of a samples array before processing them with SIMD operations, so you can reduce the amount of instructions to call to perform a given task. For example, in the Convolution class, I did use a similar trick at some point so the convolution operation itself can be done for a whole array of samples organized in an erratic way by FFTs using only 4 FloatVectorOperations calls.


#5

Thanks, Ivan.

So the tutorial’s example:

SIMDRegister<float> calculateDSPEffect (SIMDRegister<float> x,
                                        SIMDRegister<float> y)
{
    auto z = x + (y * 2.0f);
    return z;
}

…is actually deceiving. It’s not as simple as wrapping a single float. It’s actually an array of 4 floats. And your explanation makes it sound useful for multi-channel parallel operations, but not really for serial operations. I too don’t like the idea of working with interleaved blocks of audio, especially when the block size is usually the audio buffer size. Having to interleave a relatively small buffer to run an operation like a filter, and then un-interleave it doesn’t sound efficient. I’ll have to run some tests.