How to organize SSE code better + Loop Unrolling?

What processor model are you running these benchmarks on?

Tested again. Here are the CPU specs, now I test on my old i5 too.

Ryzen 7 2700 CPU  (8 Core) DDR4 16 Gig 
Instructions: SSE/AVX / Size of Buffer: 209715200 bytes (200.00 MB)

C++ Time: 18.3866 seconds
SSE (no loop) Time: 4.5876 seconds
SSE (with loop) Time: 4.5695 seconds
SSE JUCE SIMD (no loop) Time: 5.3207 seconds
SSE JUCE SIMD (with loop) Time: 5.1334 seconds
SSE OWN SIMD A (with loop) Time: 4.5891 seconds
SSE OWN SIMD A (no loop) Time: 4.5593 seconds
SSE OWN SIMD B (with loop) Time: 4.4583 seconds
SSE OWN SIMD B (with loop and using 'set') Time: 4.5699 seconds
SSE OWN SIMD B (no loop) Time: 4.5677 seconds
SSE OWN SIMD B (with loop, 'set' and direct math) Time: 4.5510 seconds
SSE OWN SIMD B (with loop and direct math) Time: 4.5700 seconds
AVX OWN (with loop and using 'set') Time: 2.7068 seconds
AVX OWN (with loop, 'set' and direct math) Time: 2.7069 seconds
AVX OWN (with loop and direct math) Time: 2.7174 seconds
SSE OWN (function call) Time: 4.5589 seconds
AVX OWN (function call) Time: 2.7430 seconds
SSE JUCE SIMD (with loop) (*) Again, just in case Time: 5.2521 seconds

i5 2310 CPU (4 Core) DDR3 8 Gig
Instructions: SSE/AVX / Size of Buffer: 209715200 bytes (200.00 MB)

C++ Time: 26.6445 seconds
SSE (no loop) Time: 6.9353 seconds
SSE (with loop) Time: 6.8833 seconds
SSE JUCE SIMD (no loop) Time: 8.4285 seconds
SSE JUCE SIMD (with loop) Time: 8.0831 seconds
SSE OWN SIMD A (with loop) Time: 6.9168 seconds
SSE OWN SIMD A (no loop) Time: 6.9125 seconds
SSE OWN SIMD B (with loop) Time: 6.9403 seconds
SSE OWN SIMD B (with loop and using 'set') Time: 5.8304 seconds
SSE OWN SIMD B (no loop) Time: 6.0347 seconds
SSE OWN SIMD B (with loop, 'set' and direct math) Time: 6.8188 seconds
SSE OWN SIMD B (with loop and direct math) Time: 6.8451 seconds
AVX OWN (with loop and using 'set') Time: 3.0074 seconds
AVX OWN (with loop, 'set' and direct math) Time: 3.0313 seconds
AVX OWN (with loop and direct math) Time: 3.0172 seconds
SSE OWN (function call) Time: 6.8854 seconds
AVX OWN (function call) Time: 3.0601 seconds
SSE JUCE SIMD (with loop) (*) Again, just in case Time: 8.1358 seconds

I just tested with a different code and the results are just crazy, so I donā€™t know what Iā€™m doing wrongā€¦ will test more and post some code later onā€¦

I just tested with a different code and the results are just crazy, so I donā€™t know what Iā€™m doing wrongā€¦

Strongly recommend not rolling your own benchmarking code. Google Bench is far more reliable. The same API can be used at quick-bench.com.

1 Like

This will probably be a silly question, butā€¦ are you using the ā€˜Releaseā€™ version of the code for these timing? Iā€™m a little suspicious of the JUCE SIMD results.

1 Like

Yes, very silly. :wink: But just in case I rebuilt the whole thing, same results. This afternoon I will make some new code to check and see how it goesā€¦

Wow, nice stuff there, will try quick-bench.com for sure, thanks. :slight_smile:

If thatā€™s true then you need to enable it explicitly: https://blogs.msdn.microsoft.com/vcblog/2016/03/30/optimizing-the-layout-of-empty-base-classes-in-vs2015-update-2-3/

Yes itā€™s true but maybe I was confusing it with the Eigen lib where the SIMD stuff suffers from this empty base class problem a lot. In any case if you look at the assembly generated by the microsoft C++ compiler for JUCE SIMD (&libsimdpp), youā€™ll quickly realize it cannot fully optimize these abstractions. Templates might also be part of the reason why.
This is of course the compilerā€™s fault, not JUCEā€™s.

Here is a fine example from simdpp and Iā€™ve seen the same type of wasteful instructions when I used the JUCE simd class. But as stated, limited to MSVC, everything is awesome with clang/gcc.

Well, in the meantime MSVC 2019 has come out. Have you tested the assembly output from that?

No for now Iā€™m stuck with the 2017 version. Once I finish my current project Iā€™ll make the switch and give it another go.

I did see that VS 2019 is in the process of getting support for clang and this would of course solve the issue and make things more consistent between platforms. Clang is already included, but canā€™t yet be used with MSBuild, but this is promised for the next larger update of VS 2019.

Here we go. Finally did my crappy little code, hope this helps someone. Now the project tests my ADSR Envelope code against N samples, and 32 voices. SSE, AVX and AVX with FMA3 codes tested.

https://www.wusik.com/download/Wusik_Tests_SSE_AVX.exe
https://www.wusik.com/download/Wusik_SSE_AVX_Tests.zip

The results on my Ryzen machine.
Instructions: SSE/AVX/FMA3 / Number Of Times: 49999999

SSE Time: 1.5358 seconds
AVX Time: 1.3715 seconds
AVX FMA3 Time: 1.2924 seconds

Still wondering what could be improved in the code and also names of functions. Some may hate the names I used, but I didnā€™t have any better ideasā€¦

Cheers

Iā€™m no longer testing the JUCE SIMD stuff as it does not support AVX and FMA3. :frowning: And still, doing my own lib I can add stuff as I seeā€¦

How do you plan to handle deployment for users without avx/fma3? Will you build different versions of your plugins or do some kind of runtime detection? Or will you just say your software requires fma3?

Do you know about the Intel IPP library?

It would be interesting to see how that compares, if youā€™re able to use it.

Iā€™m using the following to check for AVX, and will also supply SSE code. Check my project file, thanks to the template I created I donā€™t need to duplicate the big code, just the call.

if (SystemStats::hasAVX() && SystemStats::hasFMA3())

Iā€™m getting module path errors and missing file ā€œWADSREnvelope.hā€ with this download

Oops, I will make the envelope a module and attach it tomorrow. Visiting my parents right now. :slight_smile:

You can use function multiversioning in CLang and GCC, no idea if itā€™s supported in MSVC. IPP makes pretty heavy use of it iirc.

also @WilliamkWusik can you post your code as a gist or on github instead of zip files and an exe?

3 Likes

Thanks for showing this to me. I was completely unaware and this FMV mechanism looks like it can get rid of a lot of painful dispatch code. Unfortunately, no such thing exists for MSVC. ICC has it in a fully automated way where is basically builds the whole code multiple times and dispatches on launch.

So thatā€™s one more reason to look forward to full Clang support in Visual Studio 2019.

1 Like