I've been trying to do some DSP calculations on Windows using FloatVectorOperations::add. When used on MacOS - it uses Accelerate framework internally and work as a charm. But on Windows - performance is downgraded so much that the code is almost unuseful - it's sth like 5-10 times slower than in Mac.
I dug into FloatVectorOperations code and found it used _mm* functions so I wrote my own code using intrinsics - and it's performance is quite comparable to Mac's. What's strange - having all macros and functions inlined - FloatVectorOperations code is almost as simple as mine - but it's still several times slower than straight _mm* solution.
All tests are for aligned memory, I need just that. I've got JUCE_USE_SSE_INTRINSICS set properly and SSE2 options set in Visual Studio compiler. Testing on 'release' build with fastest optimization.
While checked on 'Debug' and Profiler - it's seems 1/3rd time it spends in "function body" and not in intrinsics functions. My code spends all the time in _mm* functions, leaving < 0.1% of time for "function body" whatever that means in my case.
Tests are for 1 million calculations back and forth on 2048 float vectors.
That's the whole background. My question is - am I missing some JUCE or VS compiler settings to use FloatVectorOperations in a proper way?Have you ever found such problem and got Windows version working as good as Mac one? Maybe there are some settings on Windows I just don't know.
my UnitTest is following - please be aware "Float" code should be commented out to check _mm* code . If you're able to look at this and figure out the problem... regards
startTime();
for (int i = 0; i < 10000; i++) {
juce::FloatVectorOperations::add(vc, va, vb, 2048);
}
endTime("juce:: vadd ");
startTime();
for (int i = 0; i < 10000; i++) {
for (int d = 0; d < 2048; d++) {
vc[d] = va[d] + vb[d];
}
}
endTime("c:: vadd ");
My tests:
- Intel i7, Windows 7 32b, VC 2013, compiler opt disabled, alligned memory, confirmed: juce::vector uses simd intrinsics (debugger did stop there :)
Results ( in seconds ):
juce:: vadd 0.0968898
c:: vadd 0.0518656
(I even stored computed values to prevent further compiler optimizations - but no difference)