FloatVectorOperations performance on Windows


#1

Hi,

I've been trying to do some DSP calculations on Windows using FloatVectorOperations::add. When used on MacOS - it uses Accelerate framework internally and work as a charm. But on Windows - performance is downgraded so much that the code is almost unuseful - it's sth like 5-10 times slower than in Mac.

I dug into FloatVectorOperations code and found it used _mm* functions so I wrote my own code using intrinsics - and it's performance is quite comparable to Mac's. What's strange - having all macros and functions inlined - FloatVectorOperations code is almost as simple as mine - but it's still several times slower than straight _mm* solution.

All tests are for aligned memory, I need just that. I've got JUCE_USE_SSE_INTRINSICS set properly and SSE2 options set in Visual Studio compiler. Testing on 'release' build with fastest optimization.

While checked on 'Debug' and Profiler - it's seems 1/3rd time it spends in "function body" and not in intrinsics functions. My code spends all the time in _mm* functions, leaving < 0.1% of time for "function body" whatever that means in my case.

Tests are for 1 million calculations back and forth on 2048 float vectors.

That's the whole background. My question is - am I missing some JUCE or VS compiler settings to use FloatVectorOperations in a proper way?Have you ever found such problem and got Windows version working as good as Mac one? Maybe there are some settings on Windows I just don't know.

Any help would be greatly appreciated

Regards
mqqla

 


#2

- USE_INTRINSICS turned On

- 1000 * 2048 random buffer test

- Almost 2x slower than regular cpp (no compiler opt)

- Checked on debugger,   sse functions are used, but are you sure these functions are indeed inlined? 


#3

That's odd!

Do you have some benchmark code we can try, so we can reproduce the same thing you're seeing?


#4

Hello Jules,

my UnitTest is following - please be aware "Float" code should be commented out to check _mm* code . If you're able to look at this and figure out the problem... regards

mqqla



#include "JuceHeader.h"
#include "emmintrin.h"
//#include <Accelerate/Accelerate.h>
#define MK_VEC_SIZE 2048
class DSPTest : public UnitTest
{
    int64 _sTime;
    
    float *va, *vb, *vc;
    
public:
    
    void startTime() {
        _sTime = Time::getHighResolutionTicks();
    }
    
    void endTime(String msg) {
        int64 dif = Time::getHighResolutionTicks()-_sTime;
        
        double difv = ((double)dif) / Time::getHighResolutionTicksPerSecond();
        
        logMessage(msg + " " + String((float)difv));
    }
    DSPTest() : UnitTest("DSPTest testing") {}
    void initialise() {
        va = (float*)_mm_malloc(MK_VEC_SIZE * sizeof(float), 32);
        vb = (float*)_mm_malloc(MK_VEC_SIZE * sizeof(float), 32);
        vc = (float*)_mm_malloc(MK_VEC_SIZE * sizeof(float), 32);
    }
    void runTest() {
        
        beginTest("Performance test");
        int reps = 1000000;
        startTime();
        for(int i = 0; i < reps; i++) {
            vaddlocal(va, 1, vb, 1, vc, 1, MK_VEC_SIZE);
        }
        endTime("vaddlocal 1 - performance test");
        
    }
    
    void shutdown() {
        _mm_free(va);
        _mm_free(vb);
        _mm_free(vc);
    }
    
    void vaddlocal(const float *A,
              long  IA,
              const float *B,
              long  IB,
              float       *C,
              long  IC,
                   unsigned long  N) {
        
        //vDSP_vadd(A, IA, B, IB, C, IC, N);
        //return;
        if (A == C) { FloatVectorOperations::add(C, B, N); } else
            if (B == C) {
                FloatVectorOperations::add(C, A, N);
            } else
                
                FloatVectorOperations::add(C,A,B,N);
        return;
        
        __m128 acc = _mm_set1_ps(0.0);
        
        __m128 incA;
        __m128 incB;
        for(int i = 0; i < N; i+= 4) {
            incA = _mm_load_ps(A + i);
            incB = _mm_load_ps(B + i);
            acc = _mm_add_ps(incA, incB);
            
            _mm_store_ps(C + i, acc);
        }
        return;
}

#5

    startTime();


        for (int i = 0; i < 10000; i++) {

            juce::FloatVectorOperations::add(vc, va, vb, 2048);

        }

        endTime("juce:: vadd ");


        startTime();


        for (int i = 0; i < 10000; i++) {

            for (int d = 0; d < 2048; d++) {

                vc[d] = va[d] + vb[d];

            }

        }

        endTime("c:: vadd ");

My tests:

- Intel i7, Windows 7 32b, VC 2013,  compiler opt disabled, alligned memory,  confirmed:  juce::vector uses simd intrinsics (debugger did stop there :)

 

Results ( in seconds ):

juce:: vadd  0.0968898

c::      vadd  0.0518656

 

(I even stored computed values to prevent further compiler optimizations - but no difference)

 

 

 

 

 

 

 

 


#6

Thanks - seems to be something that we broke recently, will sort it out very soon!


#7

Ok.. try again now. Seems like the compiler was refusing to optimise some references to vector types.


#8

Thanks a lot - FVO performance is just like plain _mm* now - great! It'll save me much time now.

Regards

mqqla