I have created this Envelope with only SSE code, but I wonder if there is a way to organize it better. As it does get confusing to read as things piles up. Another thing is about loop unrolling. Is there a cross-platform way to do that or some other way to handle it? Or I shouldn’t really bother about it?
Imo it’s not worth it. Even the 256bit vectors are emulated on most consumer processors, effectively halving the clock rate for AVX2 instructions. I recently benchmarked a bunch of DSP on a machine that “supported” AVX2 and AVX512 (iirc it’s a 7th gen i7?), both of them performed significantly worse than AVX1/SSE4.
Just for reference, no (consumer) AMD processor currently on the market supports true 256 bit vector instructions. The new Ryzen 3000 series that releases in July will support them, but I think it’s unclear if that’s for the entire line or just the higher end models. Unsure about threadripper, but how many of your users own one of those?
Run 1
C++ Time: 7.5106 seconds
SSE (no loop) Time: 1.9138 seconds
SSE (with loop) Time: 1.9194 seconds
SSE JUCE SIMD (no loop) Time: 2.2355 seconds
SSE JUCE SIMD (with loop) Time: 1.9991 seconds
SSE OWN SIMD A (with loop) Time: 1.9048 seconds
SSE OWN SIMD A (no loop) Time: 1.8998 seconds
SSE OWN SIMD B (with loop) Time: 1.9165 seconds
SSE OWN SIMD B (with loop and using ‘set’) Time: 1.9032 seconds
SSE OWN SIMD B (no loop) Time: 1.8782 seconds
Run 2
C++ Time: 7.2912 seconds
SSE (no loop) Time: 1.8747 seconds
SSE (with loop) Time: 1.8718 seconds
SSE JUCE SIMD (no loop) Time: 2.1755 seconds
SSE JUCE SIMD (with loop) Time: 1.9490 seconds
SSE OWN SIMD A (with loop) Time: 1.8627 seconds
SSE OWN SIMD A (no loop) Time: 1.8628 seconds
SSE OWN SIMD B (with loop) Time: 1.8621 seconds
SSE OWN SIMD B (with loop and using ‘set’) Time: 1.8729 seconds
SSE OWN SIMD B (no loop) Time: 1.8582 seconds
Run 3
C++ Time: 7.2355 seconds
SSE (no loop) Time: 1.8678 seconds
SSE (with loop) Time: 1.8644 seconds
SSE JUCE SIMD (no loop) Time: 2.1815 seconds
SSE JUCE SIMD (with loop) Time: 1.9499 seconds
SSE OWN SIMD A (with loop) Time: 1.8675 seconds
SSE OWN SIMD A (no loop) Time: 1.8675 seconds
SSE OWN SIMD B (with loop) Time: 1.8656 seconds
SSE OWN SIMD B (with loop and using ‘set’) Time: 1.8695 seconds
SSE OWN SIMD B (no loop) Time: 1.8662 seconds
interesting that you’re pretty consistently beating JUCE’s SIMD by 20%. Might want to try with something like google bench to warm up your cache for you
I just improved the code. Runs better now. Will upload next. Here’s the current stats. Using SSE and also AVX (1) now.
SSE/AVX
C++ Time: 7.2970 seconds
SSE (no loop) Time: 1.8506 seconds
SSE (with loop) Time: 1.8599 seconds
SSE JUCE SIMD (no loop) Time: 2.1543 seconds
SSE JUCE SIMD (with loop) Time: 1.9404 seconds
SSE OWN SIMD A (with loop) Time: 1.8634 seconds
SSE OWN SIMD A (no loop) Time: 1.8667 seconds
SSE OWN SIMD B (with loop) Time: 1.8616 seconds
SSE OWN SIMD B (with loop and using ‘set’) Time: 1.8656 seconds
SSE OWN SIMD B (no loop) Time: 1.8617 seconds
SSE OWN SIMD B (with loop, ‘set’ and direct math) Time: 1.8643 seconds
SSE OWN SIMD B (with loop and direct math) Time: 1.8662 seconds
SSE OWN AVX (with loop and using ‘set’) Time: 1.1071 seconds
SSE OWN AVX (with loop, ‘set’ and direct math) Time: 1.1092 seconds
SSE OWN AVX (with loop and direct math) Time: 1.1071 seconds
Basic SSE is very simple and easy to add the rest. All thanks to the guy who started this. I tried to contact him but so far no response, as I want to credit him for starting this up…
So far it seems to work great, so I will just make this a JUCE’s module and add some other stuff that I will use. I won’t add everything, only the stuff most used by me.
Something to keep in mind when doing SIMD with Visual Studio c++ is that it has troubles fully optimizing various SIMD wrappers. I believe it’s called “empty baseclass problem” or similar and it means as soon as a class is used to wrap SIMD types, that class can never be fully optimized away. I think that is why you are able to beat JUCE simd by 20%. I think on osx or linux with clang and gcc, the results would be a lot closer.
The only way to see what’s going on is looking at the compiled assembly.
For this reason I wrote helpers that don’t use a class, but just add operators to the windows types.
On clang and gcc, built-in simd types already have operators, so most of the intrinsics are not necessary there.
Instructions: SSE/AVX / Size of Buffer: 209715200 bytes (200.00 MB)
C++ Time: 18.2475 seconds
SSE (no loop) Time: 4.6734 seconds
SSE (with loop) Time: 4.6704 seconds
SSE JUCE SIMD (no loop) Time: 5.4686 seconds
SSE JUCE SIMD (with loop) Time: 5.3563 seconds
SSE OWN SIMD A (with loop) Time: 4.6749 seconds
SSE OWN SIMD A (no loop) Time: 4.6605 seconds
SSE OWN SIMD B (with loop) Time: 4.6732 seconds
SSE OWN SIMD B (with loop and using 'set') Time: 4.6696 seconds
SSE OWN SIMD B (no loop) Time: 4.6525 seconds
SSE OWN SIMD B (with loop, 'set' and direct math) Time: 4.6778 seconds
SSE OWN SIMD B (with loop and direct math) Time: 4.6727 seconds
AVX OWN (with loop and using 'set') Time: 2.7665 seconds
AVX OWN (with loop, 'set' and direct math) Time: 2.7718 seconds
AVX OWN (with loop and direct math) Time: 2.7692 seconds
SSE OWN (function call) Time: 4.6859 seconds
AVX OWN (function call) Time: 2.7908 seconds
SSE JUCE SIMD (with loop) (*) Again, just in case Time: 5.3515 seconds