How to provide memory blocks to be as close to each other as possible in C++?

I have more C++ question then JUCE or audio at all.
I would ask that on stackoverflows, but they ban me for to much stupid questions :slight_smile:

I am working now on optimisation of my plugin, where I have implemented my own FFT algorithm, so that’s quite heavy calculations.

I’ve heard that C++ works faster (more optimal) if the variables are stored in momory blocks that are as near to each other as possible.

So in concern to class creations I always wonder if faster is create variables as a class member, for example:

   float someFloat;

and define it for example in constructor, for example like that:
someFloat = 20.0f * log10((double)someVector.size()) / someOtherClassVariable.getSomeValue();

and then use that someFloat in some method where I have some quite big loop like that:

void calcMethod()
   for(int i=0; i<bigMaaan; ++i)
      for(int j=0; j<bigWoman; ++j)
          outputValue[i][someIndexVector[j]] = someFloat * something;

Or maybe better would be just like that:

void calcMethod()
   float someFloat = 20.0f * log10((double)someVector.size()) / someOtherClassVariable.getSomeValue();

   for(int i=0; i<bigMaaan; ++i)
      for(int j=0; j<bigWoman; ++j)
          outputValue[i][someIndexVector[j]] = someFloat * something;


And other question about someIndexVector from example above. Where is the best place in the class to define it and all it’s members, to avoid compiler jumps like crazy throug memory blocks?
Or maybe better would be to calculate it in place, like that:
outputValue[i][(i*(int)someDouble)%(someComponen.getWidth()*(j+1))] = someFloat * something;

Please don’t ask about calculations at all, they have no sense, just wonted to show that there are some complicated (from human point of view) operations.

Could anyone give me some hints?
Great thanks in advance.

Your second code example is definitely the better option because it reduces the scope of the someFloat variable. Making it a member of the class means every method of the class has access to it and so could change it’s value (even if it’s just you working on this code and you promise yourself you won’t change the value of someFloat anywhere other than the calcMethod() method, it’s still not good practice).

In the second example, someFloat is only available within the only method where it’s being used - which is much better!

Even if calcMethod is run often? And please find I use (double)someVector.size() in someFloat. So it’s callind the same thing all the time I call calcMethod. Isn’t it ineffective?

Your doubly nested for loop is much greater in computational complexity than the single calculation happening before it, at least if the loops do a reasonable amount of work. So it likely doesn’t matter in the big picture if you recalculate the value each time the function is called.

OK, great thanks, but what about someIndexVector[j] which is nested in loop. Better to calculate it in place like I show? Or better prepare it before in different method?

This is all the more reason to reduce the scope of someFloat and recalculate it every time calcMethod() is called - if the size of someVector changes then someFloat will need to be recalculated. If you’re really concerned, do some tests with your calculations to see if you can find any significant differences in execution time with using different techniques!

1 Like

OK, great thanks, but what about someIndexVector[j] which is nested in loop. Better to calculate it in place like I show? Or better prepare it before in different method? It seems to be even more important than someFloat

Better to benchmark it yourself.

1 Like

May I ask why you are bothering with writing your own FFT? There are plenty of FFT libraries around for all licensing needs with close to optimal performance and if you have struggle profiling a hot spot in a loop chances are good that the people who have written the FFTs might outsmart you on this particular field…

Hmm, these seem pretty small, I doubt you’ll see much of a change simply switching from an inline calculation to variable. That said there is truth to the memory alignment problem.

I’m no expert, but I’ve seen large performance increases made by creating manual memory management systems.

Say you have a bunch of audio buffers, it is better to preallocate them all at once in one large object, so they’re all next to each other in memory for fast processing, as opposed to holding their memory internal to the classes themselves.

Just for fun :slight_smile:

But mainly also for training C++ at all. I am new in programming, and I am fascinated about audio, DSP and everything about it, I like to understand things background, and I like see how things work… Things like that. Nothing special.

Oh I see, in that case it makes sense :slight_smile:

If you want to learn how to write fast code, IMHO the best way is to load up, throw a bunch of simple algorithms at it and study what the compilers do (you need to be able to at least read assembly for this though). 99% of the things that think make your code faster (unrolling loops, exchanging divisions with multiplications) have no effect because the compilers already optimise it for you.

But as Xenakios said, one calculation at the beginning of a double nested loop won’t have any measurable impact on the performance, and if you use the variable in the loop it will be loaded into a CPU register anyway.

Hey thanks for
I am just testing it but I am not sure how it works. And what it exactly does?
It looks like it builds *.obj file in the real time, or what? But actually how to read those .obj? I don’t understand that code.

It generates assembly in “real time”, but yeah, you need to at least be able to read a bit of x86 assembly in order to leverage this tool, which is highly recommended.

I regularly use it to see whether compilers auto-vectorize the code I write. For example this code:

void test(float* d, int numSamples)
    for(int i = 0; i < numSamples; i++)
        d[i] *= 5.0f;

generates this assembly with Clang O0:

        mov     eax, dword ptr [rbp - 16]
        cmp     eax, dword ptr [rbp - 12]
        jge     .LBB0_4
        movss   xmm0, dword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero,zero,zero
        mov     rax, qword ptr [rbp - 8]
        movsxd  rcx, dword ptr [rbp - 16]
        mulss   xmm0, dword ptr [rax + 4*rcx]
        movss   dword ptr [rax + 4*rcx], xmm0
        mov     eax, dword ptr [rbp - 16]
        add     eax, 1
        mov     dword ptr [rbp - 16], eax
        jmp     .LBB0_1

and this assembly with Clang O3:

        movups  xmm1, xmmword ptr [rdi + 4*rsi]
        movups  xmm2, xmmword ptr [rdi + 4*rsi + 16]
        movups  xmm3, xmmword ptr [rdi + 4*rsi + 32]
        movups  xmm4, xmmword ptr [rdi + 4*rsi + 48]
        mulps   xmm1, xmm0
        mulps   xmm2, xmm0
        movups  xmmword ptr [rdi + 4*rsi], xmm1
        movups  xmmword ptr [rdi + 4*rsi + 16], xmm2
        mulps   xmm3, xmm0
        mulps   xmm4, xmm0
        movups  xmmword ptr [rdi + 4*rsi + 32], xmm3
        movups  xmmword ptr [rdi + 4*rsi + 48], xmm4
        add     rsi, 16
        add     rdx, 2
        jne     .LBB0_8
        test    r8, r8
        je      .LBB0_11

(I’ve left out the boiler plate code that clang generates for the beginning and end of the loop). As you can see, there are a lot of movups and mulps instructions vs. mov and mulss. You can hover over each instruction and it shows a description, but a rule of thumb is that if the instruction name has ps in it (which means “packed single”), chances are good it’s a SIMD operation.

I’ve found MSVC to be particularly picky when it comes to autovectorizing, so make sure you test all compilers that you target with your projects.

It’s definitely going to destroy performance, as the compilers can’t vectorize this code.

edit: I was replying to the indirection question. MSVC is even worse here and can more or less only vectorize simple loops.