How to organize SSE code better + Loop Unrolling?

holy-city · June 18, 2019, 4:55pm

What processor model are you running these benchmarks on?

WilliamkWusik · June 18, 2019, 7:55pm

Tested again. Here are the CPU specs, now I test on my old i5 too.

Ryzen 7 2700 CPU  (8 Core) DDR4 16 Gig 
Instructions: SSE/AVX / Size of Buffer: 209715200 bytes (200.00 MB)

C++ Time: 18.3866 seconds
SSE (no loop) Time: 4.5876 seconds
SSE (with loop) Time: 4.5695 seconds
SSE JUCE SIMD (no loop) Time: 5.3207 seconds
SSE JUCE SIMD (with loop) Time: 5.1334 seconds
SSE OWN SIMD A (with loop) Time: 4.5891 seconds
SSE OWN SIMD A (no loop) Time: 4.5593 seconds
SSE OWN SIMD B (with loop) Time: 4.4583 seconds
SSE OWN SIMD B (with loop and using 'set') Time: 4.5699 seconds
SSE OWN SIMD B (no loop) Time: 4.5677 seconds
SSE OWN SIMD B (with loop, 'set' and direct math) Time: 4.5510 seconds
SSE OWN SIMD B (with loop and direct math) Time: 4.5700 seconds
AVX OWN (with loop and using 'set') Time: 2.7068 seconds
AVX OWN (with loop, 'set' and direct math) Time: 2.7069 seconds
AVX OWN (with loop and direct math) Time: 2.7174 seconds
SSE OWN (function call) Time: 4.5589 seconds
AVX OWN (function call) Time: 2.7430 seconds
SSE JUCE SIMD (with loop) (*) Again, just in case Time: 5.2521 seconds

i5 2310 CPU (4 Core) DDR3 8 Gig
Instructions: SSE/AVX / Size of Buffer: 209715200 bytes (200.00 MB)

C++ Time: 26.6445 seconds
SSE (no loop) Time: 6.9353 seconds
SSE (with loop) Time: 6.8833 seconds
SSE JUCE SIMD (no loop) Time: 8.4285 seconds
SSE JUCE SIMD (with loop) Time: 8.0831 seconds
SSE OWN SIMD A (with loop) Time: 6.9168 seconds
SSE OWN SIMD A (no loop) Time: 6.9125 seconds
SSE OWN SIMD B (with loop) Time: 6.9403 seconds
SSE OWN SIMD B (with loop and using 'set') Time: 5.8304 seconds
SSE OWN SIMD B (no loop) Time: 6.0347 seconds
SSE OWN SIMD B (with loop, 'set' and direct math) Time: 6.8188 seconds
SSE OWN SIMD B (with loop and direct math) Time: 6.8451 seconds
AVX OWN (with loop and using 'set') Time: 3.0074 seconds
AVX OWN (with loop, 'set' and direct math) Time: 3.0313 seconds
AVX OWN (with loop and direct math) Time: 3.0172 seconds
SSE OWN (function call) Time: 6.8854 seconds
AVX OWN (function call) Time: 3.0601 seconds
SSE JUCE SIMD (with loop) (*) Again, just in case Time: 8.1358 seconds

WilliamkWusik · June 18, 2019, 9:38pm

I just tested with a different code and the results are just crazy, so I don’t know what I’m doing wrong… will test more and post some code later on…

holy-city · June 18, 2019, 11:45pm

I just tested with a different code and the results are just crazy, so I don’t know what I’m doing wrong…

Strongly recommend not rolling your own benchmarking code. Google Bench is far more reliable. The same API can be used at quick-bench.com.

DaveH · June 19, 2019, 11:35am

This will probably be a silly question, but… are you using the ‘Release’ version of the code for these timing? I’m a little suspicious of the JUCE SIMD results.

WilliamkWusik · June 19, 2019, 3:13pm

Yes, very silly. But just in case I rebuilt the whole thing, same results. This afternoon I will make some new code to check and see how it goes…

WilliamkWusik · June 19, 2019, 3:14pm

Wow, nice stuff there, will try quick-bench.com for sure, thanks.

anima · June 19, 2019, 3:39pm

If that’s true then you need to enable it explicitly: https://blogs.msdn.microsoft.com/vcblog/2016/03/30/optimizing-the-layout-of-empty-base-classes-in-vs2015-update-2-3/

pflugshaupt · June 19, 2019, 3:50pm

Yes it’s true but maybe I was confusing it with the Eigen lib where the SIMD stuff suffers from this empty base class problem a lot. In any case if you look at the assembly generated by the microsoft C++ compiler for JUCE SIMD (&libsimdpp), you’ll quickly realize it cannot fully optimize these abstractions. Templates might also be part of the reason why.
This is of course the compiler’s fault, not JUCE’s.

Here is a fine example from simdpp and I’ve seen the same type of wasteful instructions when I used the JUCE simd class. But as stated, limited to MSVC, everything is awesome with clang/gcc.

reFX · June 19, 2019, 5:28pm

Well, in the meantime MSVC 2019 has come out. Have you tested the assembly output from that?

pflugshaupt · June 19, 2019, 5:46pm

No for now I’m stuck with the 2017 version. Once I finish my current project I’ll make the switch and give it another go.

I did see that VS 2019 is in the process of getting support for clang and this would of course solve the issue and make things more consistent between platforms. Clang is already included, but can’t yet be used with MSBuild, but this is promised for the next larger update of VS 2019.

WilliamkWusik · June 20, 2019, 12:59am

Here we go. Finally did my crappy little code, hope this helps someone. Now the project tests my ADSR Envelope code against N samples, and 32 voices. SSE, AVX and AVX with FMA3 codes tested.

https://www.wusik.com/download/Wusik_Tests_SSE_AVX.exe
https://www.wusik.com/download/Wusik_SSE_AVX_Tests.zip

The results on my Ryzen machine.
Instructions: SSE/AVX/FMA3 / Number Of Times: 49999999

SSE Time: 1.5358 seconds
AVX Time: 1.3715 seconds
AVX FMA3 Time: 1.2924 seconds

Still wondering what could be improved in the code and also names of functions. Some may hate the names I used, but I didn’t have any better ideas…

Cheers

WilliamkWusik · June 20, 2019, 1:00am

I’m no longer testing the JUCE SIMD stuff as it does not support AVX and FMA3. And still, doing my own lib I can add stuff as I see…

pflugshaupt · June 20, 2019, 5:26am

How do you plan to handle deployment for users without avx/fma3? Will you build different versions of your plugins or do some kind of runtime detection? Or will you just say your software requires fma3?

clarke · June 20, 2019, 8:34am

Do you know about the Intel IPP library?

It would be interesting to see how that compares, if you’re able to use it.

WilliamkWusik · June 20, 2019, 12:44pm

I’m using the following to check for AVX, and will also supply SSE code. Check my project file, thanks to the template I created I don’t need to duplicate the big code, just the call.

if (SystemStats::hasAVX() && SystemStats::hasFMA3())

DaveH · June 20, 2019, 1:29pm

I’m getting module path errors and missing file “WADSREnvelope.h” with this download

WilliamkWusik · June 20, 2019, 2:34pm

Oops, I will make the envelope a module and attach it tomorrow. Visiting my parents right now.

holy-city · June 20, 2019, 5:16pm

You can use function multiversioning in CLang and GCC, no idea if it’s supported in MSVC. IPP makes pretty heavy use of it iirc.

also @WilliamkWusik can you post your code as a gist or on github instead of zip files and an exe?

pflugshaupt · June 20, 2019, 5:49pm

Thanks for showing this to me. I was completely unaware and this FMV mechanism looks like it can get rid of a lot of painful dispatch code. Unfortunately, no such thing exists for MSVC. ICC has it in a fully automated way where is basically builds the whole code multiple times and dispatches on launch.

So that’s one more reason to look forward to full Clang support in Visual Studio 2019.

Topic		Replies	Views
No performance improvement with FloatVectorOperations General JUCE discussion	42	4703	March 12, 2024
[DSP module discussion] New class SIMDRegister General JUCE discussion	10	3266	February 21, 2019
SIMDRegister usage in Debug Audio Plugins	13	1120	September 20, 2022
FloatVectorOperations General JUCE discussion	39	3164	June 23, 2015
SIMDRegister is it worth it? General JUCE discussion	6	1948	November 4, 2022

How to organize SSE code better + Loop Unrolling?

Purchase

Discover

Learn

Support

About

Events

How to organize SSE code better + Loop Unrolling?

Related topics

Purchase

Discover

Learn

Support

About

Events