FloatVectorOperations performance on Windows

Hi,

I've been trying to do some DSP calculations on Windows using FloatVectorOperations::add. When used on MacOS - it uses Accelerate framework internally and work as a charm. But on Windows - performance is downgraded so much that the code is almost unuseful - it's sth like 5-10 times slower than in Mac.

I dug into FloatVectorOperations code and found it used _mm* functions so I wrote my own code using intrinsics - and it's performance is quite comparable to Mac's. What's strange - having all macros and functions inlined - FloatVectorOperations code is almost as simple as mine - but it's still several times slower than straight _mm* solution.

All tests are for aligned memory, I need just that. I've got JUCE_USE_SSE_INTRINSICS set properly and SSE2 options set in Visual Studio compiler. Testing on 'release' build with fastest optimization.

While checked on 'Debug' and Profiler - it's seems 1/3rd time it spends in "function body" and not in intrinsics functions. My code spends all the time in _mm* functions, leaving < 0.1% of time for "function body" whatever that means in my case.

Tests are for 1 million calculations back and forth on 2048 float vectors.

That's the whole background. My question is - am I missing some JUCE or VS compiler settings to use FloatVectorOperations in a proper way?Have you ever found such problem and got Windows version working as good as Mac one? Maybe there are some settings on Windows I just don't know.

Any help would be greatly appreciated

Regards
mqqla

 

- USE_INTRINSICS turned On

- 1000 * 2048 random buffer test

- Almost 2x slower than regular cpp (no compiler opt)

- Checked on debugger,   sse functions are used, but are you sure these functions are indeed inlined? 

That's odd!

Do you have some benchmark code we can try, so we can reproduce the same thing you're seeing?

Hello Jules,

my UnitTest is following - please be aware "Float" code should be commented out to check _mm* code . If you're able to look at this and figure out the problem... regards

mqqla



#include "JuceHeader.h"
#include "emmintrin.h"
//#include <Accelerate/Accelerate.h>
#define MK_VEC_SIZE 2048
class DSPTest : public UnitTest
{
    int64 _sTime;
    
    float *va, *vb, *vc;
    
public:
    
    void startTime() {
        _sTime = Time::getHighResolutionTicks();
    }
    
    void endTime(String msg) {
        int64 dif = Time::getHighResolutionTicks()-_sTime;
        
        double difv = ((double)dif) / Time::getHighResolutionTicksPerSecond();
        
        logMessage(msg + " " + String((float)difv));
    }
    DSPTest() : UnitTest("DSPTest testing") {}
    void initialise() {
        va = (float*)_mm_malloc(MK_VEC_SIZE * sizeof(float), 32);
        vb = (float*)_mm_malloc(MK_VEC_SIZE * sizeof(float), 32);
        vc = (float*)_mm_malloc(MK_VEC_SIZE * sizeof(float), 32);
    }
    void runTest() {
        
        beginTest("Performance test");
        int reps = 1000000;
        startTime();
        for(int i = 0; i < reps; i++) {
            vaddlocal(va, 1, vb, 1, vc, 1, MK_VEC_SIZE);
        }
        endTime("vaddlocal 1 - performance test");
        
    }
    
    void shutdown() {
        _mm_free(va);
        _mm_free(vb);
        _mm_free(vc);
    }
    
    void vaddlocal(const float *A,
              long  IA,
              const float *B,
              long  IB,
              float       *C,
              long  IC,
                   unsigned long  N) {
        
        //vDSP_vadd(A, IA, B, IB, C, IC, N);
        //return;
        if (A == C) { FloatVectorOperations::add(C, B, N); } else
            if (B == C) {
                FloatVectorOperations::add(C, A, N);
            } else
                
                FloatVectorOperations::add(C,A,B,N);
        return;
        
        __m128 acc = _mm_set1_ps(0.0);
        
        __m128 incA;
        __m128 incB;
        for(int i = 0; i < N; i+= 4) {
            incA = _mm_load_ps(A + i);
            incB = _mm_load_ps(B + i);
            acc = _mm_add_ps(incA, incB);
            
            _mm_store_ps(C + i, acc);
        }
        return;
}

    startTime();


        for (int i = 0; i < 10000; i++) {

            juce::FloatVectorOperations::add(vc, va, vb, 2048);

        }

        endTime("juce:: vadd ");


        startTime();


        for (int i = 0; i < 10000; i++) {

            for (int d = 0; d < 2048; d++) {

                vc[d] = va[d] + vb[d];

            }

        }

        endTime("c:: vadd ");

My tests:

- Intel i7, Windows 7 32b, VC 2013,  compiler opt disabled, alligned memory,  confirmed:  juce::vector uses simd intrinsics (debugger did stop there :)

 

Results ( in seconds ):

juce:: vadd  0.0968898

c::      vadd  0.0518656

 

(I even stored computed values to prevent further compiler optimizations - but no difference)

 

 

 

 

 

 

 

 

Thanks - seems to be something that we broke recently, will sort it out very soon!

Ok.. try again now. Seems like the compiler was refusing to optimise some references to vector types.

Thanks a lot - FVO performance is just like plain _mm* now - great! It'll save me much time now.

Regards

mqqla