FloatVectorOperations performance on Windows

mqqla · April 17, 2015, 1:25pm

Hi,

I've been trying to do some DSP calculations on Windows using FloatVectorOperations::add. When used on MacOS - it uses Accelerate framework internally and work as a charm. But on Windows - performance is downgraded so much that the code is almost unuseful - it's sth like 5-10 times slower than in Mac.

I dug into FloatVectorOperations code and found it used _mm* functions so I wrote my own code using intrinsics - and it's performance is quite comparable to Mac's. What's strange - having all macros and functions inlined - FloatVectorOperations code is almost as simple as mine - but it's still several times slower than straight _mm* solution.

All tests are for aligned memory, I need just that. I've got JUCE_USE_SSE_INTRINSICS set properly and SSE2 options set in Visual Studio compiler. Testing on 'release' build with fastest optimization.

While checked on 'Debug' and Profiler - it's seems 1/3rd time it spends in "function body" and not in intrinsics functions. My code spends all the time in _mm* functions, leaving < 0.1% of time for "function body" whatever that means in my case.

Tests are for 1 million calculations back and forth on 2048 float vectors.

That's the whole background. My question is - am I missing some JUCE or VS compiler settings to use FloatVectorOperations in a proper way?Have you ever found such problem and got Windows version working as good as Mac one? Maybe there are some settings on Windows I just don't know.

Any help would be greatly appreciated

Regards
mqqla

giku · April 17, 2015, 4:05pm

- USE_INTRINSICS turned On

- 1000 * 2048 random buffer test

- Almost 2x slower than regular cpp (no compiler opt)

- Checked on debugger, sse functions are used, but are you sure these functions are indeed inlined?

jules · April 17, 2015, 4:10pm

That's odd!

Do you have some benchmark code we can try, so we can reproduce the same thing you're seeing?

mqqla · April 18, 2015, 8:29am

Hello Jules,

my UnitTest is following - please be aware "Float" code should be commented out to check _mm* code . If you're able to look at this and figure out the problem... regards

mqqla



#include "JuceHeader.h"
#include "emmintrin.h"
//#include <Accelerate/Accelerate.h>
#define MK_VEC_SIZE 2048
class DSPTest : public UnitTest
{
    int64 _sTime;
    
    float *va, *vb, *vc;
    
public:
    
    void startTime() {
        _sTime = Time::getHighResolutionTicks();
    }
    
    void endTime(String msg) {
        int64 dif = Time::getHighResolutionTicks()-_sTime;
        
        double difv = ((double)dif) / Time::getHighResolutionTicksPerSecond();
        
        logMessage(msg + " " + String((float)difv));
    }
    DSPTest() : UnitTest("DSPTest testing") {}
    void initialise() {
        va = (float*)_mm_malloc(MK_VEC_SIZE * sizeof(float), 32);
        vb = (float*)_mm_malloc(MK_VEC_SIZE * sizeof(float), 32);
        vc = (float*)_mm_malloc(MK_VEC_SIZE * sizeof(float), 32);
    }
    void runTest() {
        
        beginTest("Performance test");
        int reps = 1000000;
        startTime();
        for(int i = 0; i < reps; i++) {
            vaddlocal(va, 1, vb, 1, vc, 1, MK_VEC_SIZE);
        }
        endTime("vaddlocal 1 - performance test");
        
    }
    
    void shutdown() {
        _mm_free(va);
        _mm_free(vb);
        _mm_free(vc);
    }
    
    void vaddlocal(const float *A,
              long  IA,
              const float *B,
              long  IB,
              float       *C,
              long  IC,
                   unsigned long  N) {
        
        //vDSP_vadd(A, IA, B, IB, C, IC, N);
        //return;
        if (A == C) { FloatVectorOperations::add(C, B, N); } else
            if (B == C) {
                FloatVectorOperations::add(C, A, N);
            } else
                
                FloatVectorOperations::add(C,A,B,N);
        return;
        
        __m128 acc = _mm_set1_ps(0.0);
        
        __m128 incA;
        __m128 incB;
        for(int i = 0; i < N; i+= 4) {
            incA = _mm_load_ps(A + i);
            incB = _mm_load_ps(B + i);
            acc = _mm_add_ps(incA, incB);
            
            _mm_store_ps(C + i, acc);
        }
        return;
}

giku · April 20, 2015, 7:36am

    startTime();


        for (int i = 0; i < 10000; i++) {

            juce::FloatVectorOperations::add(vc, va, vb, 2048);

        }

        endTime("juce:: vadd ");


        startTime();


        for (int i = 0; i < 10000; i++) {

            for (int d = 0; d < 2048; d++) {

                vc[d] = va[d] + vb[d];

            }

        }

        endTime("c:: vadd ");

My tests:

- Intel i7, Windows 7 32b, VC 2013, compiler opt disabled, alligned memory, confirmed: juce::vector uses simd intrinsics (debugger did stop there :)

Results ( in seconds ):

juce:: vadd 0.0968898

c:: vadd 0.0518656

(I even stored computed values to prevent further compiler optimizations - but no difference)

jules · April 20, 2015, 10:39am

Thanks - seems to be something that we broke recently, will sort it out very soon!

jules · April 20, 2015, 10:51am

Ok.. try again now. Seems like the compiler was refusing to optimise some references to vector types.

mqqla · April 20, 2015, 12:23pm

Thanks a lot - FVO performance is just like plain _mm* now - great! It'll save me much time now.

Regards

mqqla

Topic		Replies	Views
No performance improvement with FloatVectorOperations General JUCE discussion	42	4884	March 12, 2024
FloatVectorOperations General JUCE discussion	39	3205	June 23, 2015
FloatVectorOperations crash General JUCE discussion	13	966	November 9, 2013
Newbie Question: Where is the vector math lib? General JUCE discussion	2	89	January 14, 2025
Errors when trying to compile a 64bits plug-in with Intel compiler Windows	8	1284	February 11, 2015

FloatVectorOperations performance on Windows

Purchase

Discover

Learn

Support

About

Events

FloatVectorOperations performance on Windows

Related topics

Purchase

Discover

Learn

Support

About

Events