SIMD Register Tutorial Multichannel

Hi all, I tried to follow the SIMD Tutorial (JUCE: Tutorial: Optimisation using the SIMDRegister class)

I can’t figure out how to use it for more channels than the register size.
I’m on an M1 Mac with supports a 128bit registry, so I can fit 4 floats in there.
For simplicity I’d process only multiples of 4, so let’s say 12 channels.

To test the multichannel stuff, i created an AudioBuffer with 12 channels and copied sample data to it in getNextAudioBlock( const juce::AudioSourceChannelInfo& bufferToFill)

Should i then in the SIMDTutorialFilter create multiple iir filters?
std::vector<std::unique_ptr<dsp::IIR::Filter<dsp::SIMDRegister<float>>>> iir;

Then in process() align and interleave them and call iir->process on all of them?

  for (int i = 0; i < numChannels / registerSize; ++i)
            auto subInputBlock = inputBlock.getSubsetChannelBlock (i * registerSize, registerSize);
            auto inChannels = prepareChannelPointers(subInputBlock);
            using Format = AudioData::Format<AudioData::Float32, AudioData::NativeEndian>;
            AudioData::interleaveSamples(AudioData::NonInterleavedSource<Format> {, registerSize },
                                         AudioData::InterleavedDest<Format>      {toBasePointer(interleaved.getChannelPointer(0)), registerSize},
            auto outputBlock = context.getOutputBlock();
            auto subOutputBlock = outputBlock.getSubsetChannelBlock (i * registerSize, registerSize);

            auto outChannels = prepareChannelPointers (subOutputBlock);
            AudioData::deinterleaveSamples(AudioData::InterleavedSource<Format> {toBasePointer(interleaved.getChannelPointer(0)), registerSize},
                                           AudioData::NonInterleavedDest<Format> {, registerSize}, numSamples);

It feels like the wrong thing to do and also introduces crackling on all but the first 4 channels :confused:
Any better solutions?
thanks! :heart: