Oversampling efficiency

I had another look and a good reason is only having to read samples from memory once instead of going through the buffer for each stage. However, it would be easily possible to just use one loop for both the direct and the delayed stages.
NumStages is always even, so both loops iterate the same number of times. That might even give the compiler a chance to do that SIMD optimization for us as it’s two times the same independent code in a loop.

Even better would be a template solution where numStages/directStages was a template parameter. That way the compiler could completely get rid of loop iteration checks for each sample.