Modulo in processBlock?

Hi there,

does it belong to micro optimising or is there a real performance impact when using modulo calculations in the process block?

I have several wrapping commands around my ringbuffer read/write position.
Things like

rbWritePos += buffer.getNumSamples();
rbWritePos %= ringBuffer.getNumSamples();

or
auto startPos = (rbWritePos + rbNumSamples - blockSize) % rbNumSamples;

I’m wondring if I should get rid of it or not.
And if so, what would be the best way?
I cannot guarantee that my buffer size is a power of two.

Thanks
Stefan

You’d have to time it to see if there’s any difference, but the fist thing that comes to mind is just putting an if statement in to wrap the number back into the range. Maybe write some unit tests and see what’s what.

It probably is a microoptimisation, but modulo is a slow operation indeed and, for most of this kind of cases, it’s easy to avoid. Just make your buffer size a power of 2, and then you can replace every % size with & (size - 1).

1 Like

I agree with it beeing a micro-optimisation, but there is no need for a power of two.
Just use an if:

rbWritePos += buffer.getNumSamples();
if (rbWritePos >= ringBuffer.getNumSamples())
    rbWritePos -= ringBuffer.getNumSamples();

I know there is an argument against branching, but I have to see evidence that it is really that bad…

I’ve never measured a difference either, but in all my uses this is tiny for the context. It may be relevant for something heavily buffer-based. In abstract, I’d expect branching to be worse only if the buffer is not much bigger than the average advance, so that wrapping (hence misprediction) happens often. If that’s not the case, branching may actually be faster, as a correct guess would make it a no-op.

Branching is one of the fastest operations when branch predictor guesses it right, when it does wrong is one of the slowest. Modulus on the other hand is always pretty slow. So if a satement is “right in most situations”, like a ring buffer with 1000 samples long, 999 out of 1000 times the write position being smaller than that length, the branch predictor will do a good guess and will fast. That’s why you put first the condition that is true most of the time in an if/else, specially in heavy loops/buffer processing. So in this situation optimizing modulus with branching is appropriate.

In a extreme situation like a random number generator between 0.0 and 1.0, trying to guess if the number is bigger than 0.5 (50% chance), will make branch predictor miss about half the time so branching will be indeed very slow.

Edit: on modern processors

1 Like

If you look at the performance impact of branching, it’s not just about the branch operation itself. Regardless of the outcome, every branching operation very much complicates optimization during compilation. Modern CPUs with long pipelines and many registers like long streaks of branchless code. For those, the compiler can shuffle some operations around to improve pipelining, use as many registers as possible and avoid waiting for operations with longer latencies.
This can make a dramatic performance difference, especially if multiple branching happens on every sample of a DSP loop.

1 Like

Most modern processors execute branching operations on parallel with the FPU, and when guessed right the branching executes in only one cycle on most CPUs (equally to a sum or substraction). This means they can be virtually free of cost when done right. Modulus is just slow as it implies divisions, and if you are already saturating your FPU with other arithmetics it will just be slower no matter what.

As for the compiler not doing a good job, if you reach a level where a modulus microoptimization is needed then you are not leaving the compiler any freedom to do guesses. That means you vectorize with SIMD intrinsics (even with assembly SIMD as some compilers don’t do a good job with some intrinsics like NEON ones), pipeline them, interleave them with norrmal operations if the arch has parallel FPU and SIMD units, you optimize the cache access or even the registers in assembly, etc.

But to simply end the discussion: OP just profile using modulus in your processBlock versus using an if statement like Daniel’s code. Then pick what’s best.

I’m not that knowledagable by the costs. But why is it a ‘real performance impact’?

From the code-snippet it looks like this happens per buffer rather than per sample or I’ve missed something?

Also, this is non-FPU. It’s plain integers and even 32bit ones that can be easily optimized into single register on 64bit devices.

And I assume a decent compiler does tries to optimize such things.

But then again, I might be missing something :slight_smile:

One suggestion though,
Try compiling your minimal code of whatever you want to evaluate on compiler explorer so you can evaluate this better I guess.

1 Like

This is a classic case of micro-optimisation.

Using a modulo for buffer position wrapping won’t be your bottleneck.

2 Likes

Using a power of two so you can wrap with & (length - 1) can also be quite wasteful in memory for longer delay lines.

2 Likes

I think this is indeed valid and measurable optimisation if we’re talking about operations that run at audio rate, in particular if you need a bunch of those.
Good example might be a reverb that uses several relatively short delay lines.

The pretty straight forward reason for modulus being slow is that it will compile to an integer division. Way more expensive then, let’s say, a 32bit float multiplication. Thus, the same best practices as for divisions in general apply. Check out Compiler Explorer to try it out, and maybe consult the comprehensive instruction cost list for various architectures here https://www.agner.org/optimize/instruction_tables.pdf - see the high latency for IDIV, maybe for Skylake, and compare with MUL.

I’ve conclusively measured modulus making a difference in practice.
It probably won’t be a “bottleneck”, neither will it be very visible if the other stuff in that inner loop consists of 20 expensive trigonometry functions :wink:

For integer wrapping, like high efficiency ringbuffers, I tend to go with power of 2 sizes, ideally align the memory too. Then doing the increment and bitmasking.

For floating point wrapping (like phase of an oscillator from 0 to 2 pi), the subtract-if works, even though it looks somewhat ugly in the source code. If you look at the disassembly from an optimizing compiler, chances are good this will result in a conditional move, so really not much wasted here.

While at that, and to venture into actual microoptimisation terrain: I also observed that depending on the particular compiler, the precise formulation can make a difference sometimes: In particular, using a ternary operator vs. using a plain old if statement like in the example given by daniel. It becomes much harder to measure a difference reliably because other temporary phenomena involving thermal states, exact memory layout vs. cache and such things. In such cases, it’s probably better to look elsewhere to improve performance and not get totally lost in it, fascinating as this stuff can be :slight_smile:

…which is trivial to hide behind a free function, if you are allergic against the word if :wink: :

void advance (juce::AudioBuffer<float>& buffer, float position& position, float increment=1.0f)
{
    auto newPosition = position + increment;
    return newPosition < buffer.getNumSamples() ? newPosition : newPosition - buffer.getNumSamples();
}

advance (buffer, rbWritePos);

It’s easy to get distracted by the representation. I like to read code like prose, “if this condition then that happens”. But some people prefer to see formulas, which can be accommodated.

I would be surprised if the compiler makes a difference between this implementation or if you spell “if”.

Hey Daniel,
you’re right, which is why I got confused once when this was not the case. It probably had to do with global optimization and the surrounding code… I really can’t say. It was not worth investigating any further.
When checking in Compiler Explorer with an isolated example, msvc 64 latest and /O2, results are indeed equivalent.

With the example you gave, there may be ugly warnings regarding implicit type conversions though :wink:

Also, it looks like you made me fall into the Compiler Explorer rabbit hole again… I’m currently trying to find out under what conditions the compiler thinks that a branch is better compiled into short jumps (probably because it’s trivially to predict), and when it will generate conditional moves instead… There goes my productivity!

Thanks for the feedback. It was meant for illustration only. But where do you see a conversion warning?

But the actual bug is, it should not return the position but assign it :man_facepalming:
Here again in correct:



void advance (juce::AudioBuffer<float>& buffer, float position& position, float increment=1.0f)
{
    auto newPosition = position + increment;
    position = newPosition < buffer.getNumSamples() ? newPosition : newPosition - buffer.getNumSamples();
}

advance (buffer, rbWritePos);

Yeah, turned the idea into something Compiler Explorer friendly anyway. Doesn’t affect the generated code much.
I’d expect a warning related to mixing floats and buffer.getNumSamples(). But doesn’t really matter.

Just observed: clang output is TOTALLY different from what gcc and msvc produce.

Right now playing with this, trying out slight variations and compiler settings and observing what changes:

const int numSamples = 512;

inline float advance (float& position, float increment)
{
    float newPosition = position + increment;
    if (newPosition >= numSamples)
        newPosition -= numSamples;
    return newPosition;

}

int main(){
    float pos = 0;
    for (int k = 0; k < 1024; ++k)
    {
        pos = advance(pos, 0.5f);
    }
    return pos;
}
 

Think I figured it out: MSVC (x64, O2) may generate cmov when using the ternary operator, but it may not if you use an if statement. Here is an example where this is the case in the generated code for advance and advance_if. Note that this doesn’t affect main, it’s all inlined, compiler figures out that the result is equivalent. So, in the case of MSVC, ternary vs if can actually make a difference.
Academic and kinda irrelevant, but somewhat interesting I think :slight_smile:

const int numSamples = 512;

void advance (int& position)
{
    ++position;
    position = (position >= numSamples ? (position - numSamples) : position);
}

void advance_if (int& position)
{
    ++position;
    if (position >= numSamples)
         position -= numSamples;
}


int main(){
    int pos = 0;
    for (int k = 0; k < 1024; ++k)
    {
        advance(pos);
    }
    return pos;
}

Assembly:

position$ = 8
void advance(int &) PROC                                ; advance, COMDAT
        mov     edx, DWORD PTR [rcx]
        inc     edx
        cmp     edx, 512                      ; 00000200H
        lea     eax, DWORD PTR [rdx-512]
        cmovl   eax, edx
        mov     DWORD PTR [rcx], eax
        ret     0
void advance(int &) ENDP                                ; advance

position$ = 8
void advance_if(int &) PROC                   ; advance_if, COMDAT
        mov     eax, DWORD PTR [rcx]
        inc     eax
        mov     DWORD PTR [rcx], eax
        cmp     eax, 512                      ; 00000200H
        jl      SHORT $LN2@advance_if
        add     eax, -512               ; fffffffffffffe00H
        mov     DWORD PTR [rcx], eax
$LN2@advance_if:
        ret     0
void advance_if(int &) ENDP                   ; advance_if

main    PROC                                            ; COMDAT
        xor     eax, eax
        mov     r8d, 512                      ; 00000200H
        mov     r9d, 2
        npad    2
$LL4@main:
        lea     ecx, DWORD PTR [rax+1]
        mov     edx, -510               ; fffffffffffffe02H
        cmp     ecx, 512                      ; 00000200H
        cmovl   edx, r9d
        lea     ecx, DWORD PTR [rdx+rax]
        cmp     ecx, 512                      ; 00000200H
        lea     eax, DWORD PTR [rcx-512]
        cmovl   eax, ecx
        sub     r8, 1
        jne     SHORT $LL4@main
        ret     0
main    ENDP