Hello @fefanto !
For the interleaving, it depends on what kind of optimization you want to achieve. Don’t forget that there is no magic at all in SIMDRegister class and SIMD vectorization in general Sometimes, a process is just impossible to “parallelize”, mostly in cases where the result for the sample n+1 depends on the result for the sample n. IIR filters processing is the most obvious case for such an issue.
So what is still possible there is to parallelize multi-channel processing using SIMD, but not the process itself for a given single channel. To do that, you need to find the most efficient way to get that SIMDRegister variable filled, so you can apply your operations. For a IIRFilter, OK you can’t parallelize one process, but if you need to do it 4 times for every sample (think multi-channel processing of course but also parallel filtering, for example with N*4 bandpass filters in parallel for every sample). The obvious way is to create one SIMDRegister at every for loop iteration, and then fill its content, but you’re going to do a lot of memory operations this way.
A better way would be to access directly to some already aligned audio data with a pointer, and then process that directly as well. AudioBuffer object data is already aligned if it’s possible, but unfortunately for multi-channel vectorization the data is ordered sequentially with regards to samples. You have your N samples for channel 0, then your N samples for channel 1 etc. That’s where the interleaving algorithm is important, since it allows with a CPU cost as low as possible thanks to associated optimizations to organize the data another way prior to processing, with 4 samples 0 for channels 0-3, then 4 samples 1 etc.
Processing them this way allows you to do the multi-channel processing vectorization, and to use the SIMDRegister operations, which become handy since thanks to it you can write your DSP algorithm the exact same way you would do it for float variables with + and * operators without changing anything. And of course you would need to do the inverse interleaving at the end of the function to return the result with the right organization.
It’s something quite usual in SIMD development to change the organization of a samples array before processing them with SIMD operations, so you can reduce the amount of instructions to call to perform a given task. For example, in the Convolution class, I did use a similar trick at some point so the convolution operation itself can be done for a whole array of samples organized in an erratic way by FFTs using only 4 FloatVectorOperations calls.