Oversampling efficiency

Oversampling2TimesPolyphaseIIR::processSamplesUp() seems to be cpu heaviest point in my current project.

I see that the processing is done like that in there :

for (size_t i = 0; i < numSamples; ++i)
    // Direct path cascaded allpass filters
    auto input = samples[i];
    for (auto n = 0; n < directStages; ++n)


I’ve got the feeling that it would be much much more efficient to have the outer loop on the stages, and the inner one on the samples, even if that imply an extra buffer.
Did you consider/measure it?
or there is a good reason for that being that way?

Hello !

No specific reason for this here, you might be right having things the other way around might improve performance, with the need of an extra audio buffer. I’ll do some testing and see if I can optimize that a little bit.

1 Like

I’m sorry to nag about this again, but there is a really big chance for optimization that is missing currently and made me go back to the hiir library. The polyphase IIR filters could be calculated in parallel with SSE to perform almost twice as good. Just have a look at hiir and its SSE implementation. The loop mentioned would also help of course and it could probably be done without an extra buffer, but by working multiple times on the large output buffer with offsets to prevent overwriting data.

but there is a really big chance for optimization that is missing currently

The current approach to vectorization in the DSP module is for multi-mono use cases (interleave your data, evaluate each channel in parallel using SIMDRegister). Whether or not that fits your application or if it’s better than vectorizing individual channels depends a lot on what you’re doing.

If you want single channel speed you can use Intel’s IPP library.

If you want to write your own DSP there are a lot of references for vectorizing SISO IIR filters. Here are some of them.

I am aware of these things, but what I mean is taking advantage of the polyphase iir structure. By design it uses two filters in parallel and these can be evaluated at the same time using SIMD. The juce simd abstractions could be used for that. In the current state it is a missed opportunity and hiir does the exact same thing with better performance.

I had another look and a good reason is only having to read samples from memory once instead of going through the buffer for each stage. However, it would be easily possible to just use one loop for both the direct and the delayed stages.
NumStages is always even, so both loops iterate the same number of times. That might even give the compiler a chance to do that SIMD optimization for us as it’s two times the same independent code in a loop.

Even better would be a template solution where numStages/directStages was a template parameter. That way the compiler could completely get rid of loop iteration checks for each sample.