How to organize SSE code better + Loop Unrolling?

WilliamkWusik · June 17, 2019, 4:18pm

I have created this Envelope with only SSE code, but I wonder if there is a way to organize it better. As it does get confusing to read as things piles up. Another thing is about loop unrolling. Is there a cross-platform way to do that or some other way to handle it? Or I shouldn’t really bother about it?

Here’s the current project’s code.
https://www.wusik.com/download/Wusik_ZR_002.zip

You can check my videos where I explain this live on twitch.

Here’s a snip of the ADSR envelope SSE processing code.

const static __m128 xNum0 = _mm_setzero_ps();
const static __m128 xNum1 = _mm_set1_ps(1.0f);
const static __m128 xNum2 = _mm_set1_ps(2.0f);
const static __m128 xNum10 = _mm_set1_ps(10.0f);

for (int xvoice = 0; xvoice < MAX_INTERNAL_VOICES; xvoice += 4)
{
	__m128 xValue = _mm_load_ps(ADSREnvelope.value + xvoice);
	__m128 xPositon = _mm_load_ps(ADSREnvelope.position + xvoice);
	__m128 xCurve = _mm_set1_ps(valuesList[kADSR_Curve]);
	//
	__m128 xAllRates = _mm_and_ps(_mm_cmpeq_ps(xPositon, ADSR_ATTACK_SSE), _mm_load_ps(ADSREnvelope.rate[WusikADSREnvelope::kRate_Attack] + xvoice));
	xAllRates = _mm_add_ps(xAllRates, _mm_and_ps(_mm_cmpeq_ps(xPositon, ADSR_DECAY_RELEASE_SSE), _mm_load_ps(ADSREnvelope.rate[WusikADSREnvelope::kRate_DecayRelease] + xvoice)));
	xValue = _mm_add_ps(xValue, xAllRates);
	xPositon = _mm_add_ps(xPositon, _mm_and_ps(_mm_cmpgt_ps(xValue, xNum1), xNum1));
	xValue = _mm_min_ps(xNum1, _mm_max_ps(xValue, _mm_and_ps(_mm_cmpeq_ps(xPositon, xNum1), _mm_load_ps(ADSREnvelope.sustain + xvoice))));
	//
	_mm_store_ps(ADSREnvelope.value + xvoice, xValue);
	_mm_store_ps(ADSREnvelope.position + xvoice, xPositon);
	//
	xValue = _mm_mul_ps(_mm_load_ps(ADSREnvelope.velocity + xvoice),
		_mm_min_ps(xNum1, _mm_mul_ps(
			xValue,
			_mm_add_ps(_mm_mul_ps(_mm_set1_ps(valuesList[kADSR_Clip]), xNum10), xNum1))));
	//
	_mm_store_ps(ADSREnvelope.output + xvoice,
		_mm_add_ps(_mm_mul_ps(_mm_mul_ps(_mm_mul_ps(xValue, xValue), _mm_mul_ps(xValue, xValue)), xCurve), _mm_mul_ps(xValue, _mm_sub_ps(xNum1, xCurve))));
}

Cheers, WilliamK

holy-city · June 17, 2019, 5:17pm

The SIMDRegister class can help which has a constexpr size() method, so your loop unroll could look like:


template<class T> 
inline void adsr_loop  ( ) {
    using Vec = juce::dsp::SIMDRegister<T>;
    for (int xvoice = 0 ; xvoice < MAX_INTERNAL_VOICES; xvoice += Vec::size() ) {
        auto xValue = Vec::fromRawArray(ADSREnvelope.value + xvoice);
        auto xPosition = Vec::fromRawArray(ADSREnvelope.position + xvoice);
        //... etc 
    }
}

WilliamkWusik · June 17, 2019, 5:33pm

Thanks, will check that one out.

WilliamkWusik · June 17, 2019, 8:56pm

I’m really loving SIMDRegister, looks good so far. But I wonder about AVX 8*float support…

holy-city · June 17, 2019, 9:22pm

But I wonder about AVX 8*float support…

Imo it’s not worth it. Even the 256bit vectors are emulated on most consumer processors, effectively halving the clock rate for AVX2 instructions. I recently benchmarked a bunch of DSP on a machine that “supported” AVX2 and AVX512 (iirc it’s a 7th gen i7?), both of them performed significantly worse than AVX1/SSE4.

Just for reference, no (consumer) AMD processor currently on the market supports true 256 bit vector instructions. The new Ryzen 3000 series that releases in July will support them, but I think it’s unclear if that’s for the entire line or just the higher end models. Unsure about threadripper, but how many of your users own one of those?

WilliamkWusik · June 17, 2019, 9:39pm

Got it, thanks.

I just created a small tool to benchmark a bit some stuff, will post in this thread soon.

holy-city · June 17, 2019, 9:43pm

Can you post the benchmark code too? I’d be happy to run it on my machines to get more data points, I’m sure others around here might help out too.

WilliamkWusik · June 17, 2019, 9:46pm

Here we go.

https://www.wusik.com/download/SSE_Benchmarking.zip

I got a Ryzen CPU, will try on my older i5 too.

WilliamkWusik · June 17, 2019, 9:49pm

Using #define X_SIZE (1024102480)

Run 1
C++ Time: 7.5106 seconds
SSE (no loop) Time: 1.9138 seconds
SSE (with loop) Time: 1.9194 seconds
SSE JUCE SIMD (no loop) Time: 2.2355 seconds
SSE JUCE SIMD (with loop) Time: 1.9991 seconds
SSE OWN SIMD A (with loop) Time: 1.9048 seconds
SSE OWN SIMD A (no loop) Time: 1.8998 seconds
SSE OWN SIMD B (with loop) Time: 1.9165 seconds
SSE OWN SIMD B (with loop and using ‘set’) Time: 1.9032 seconds
SSE OWN SIMD B (no loop) Time: 1.8782 seconds

Run 2
C++ Time: 7.2912 seconds
SSE (no loop) Time: 1.8747 seconds
SSE (with loop) Time: 1.8718 seconds
SSE JUCE SIMD (no loop) Time: 2.1755 seconds
SSE JUCE SIMD (with loop) Time: 1.9490 seconds
SSE OWN SIMD A (with loop) Time: 1.8627 seconds
SSE OWN SIMD A (no loop) Time: 1.8628 seconds
SSE OWN SIMD B (with loop) Time: 1.8621 seconds
SSE OWN SIMD B (with loop and using ‘set’) Time: 1.8729 seconds
SSE OWN SIMD B (no loop) Time: 1.8582 seconds

Run 3
C++ Time: 7.2355 seconds
SSE (no loop) Time: 1.8678 seconds
SSE (with loop) Time: 1.8644 seconds
SSE JUCE SIMD (no loop) Time: 2.1815 seconds
SSE JUCE SIMD (with loop) Time: 1.9499 seconds
SSE OWN SIMD A (with loop) Time: 1.8675 seconds
SSE OWN SIMD A (no loop) Time: 1.8675 seconds
SSE OWN SIMD B (with loop) Time: 1.8656 seconds
SSE OWN SIMD B (with loop and using ‘set’) Time: 1.8695 seconds
SSE OWN SIMD B (no loop) Time: 1.8662 seconds

WilliamkWusik · June 17, 2019, 9:55pm

I just added avx support or 8 floats instead of 4 floats and it did speed up by nearly 40% :-o

SSE OWN SIMD B (with loop and using ‘set’) Time: 1.8574 seconds
SSE OWN AVX (with loop and using ‘set’) Time: 1.1289 seconds

holy-city · June 17, 2019, 10:03pm

interesting that you’re pretty consistently beating JUCE’s SIMD by 20%. Might want to try with something like google bench to warm up your cache for you

WilliamkWusik · June 18, 2019, 2:30pm

I just improved the code. Runs better now. Will upload next. Here’s the current stats. Using SSE and also AVX (1) now.

SSE/AVX

C++ Time: 7.2970 seconds
SSE (no loop) Time: 1.8506 seconds
SSE (with loop) Time: 1.8599 seconds
SSE JUCE SIMD (no loop) Time: 2.1543 seconds
SSE JUCE SIMD (with loop) Time: 1.9404 seconds
SSE OWN SIMD A (with loop) Time: 1.8634 seconds
SSE OWN SIMD A (no loop) Time: 1.8667 seconds
SSE OWN SIMD B (with loop) Time: 1.8616 seconds
SSE OWN SIMD B (with loop and using ‘set’) Time: 1.8656 seconds
SSE OWN SIMD B (no loop) Time: 1.8617 seconds
SSE OWN SIMD B (with loop, ‘set’ and direct math) Time: 1.8643 seconds
SSE OWN SIMD B (with loop and direct math) Time: 1.8662 seconds
SSE OWN AVX (with loop and using ‘set’) Time: 1.1071 seconds
SSE OWN AVX (with loop, ‘set’ and direct math) Time: 1.1092 seconds
SSE OWN AVX (with loop and direct math) Time: 1.1071 seconds

WilliamkWusik · June 18, 2019, 2:32pm

Basic SSE is very simple and easy to add the rest. All thanks to the guy who started this. I tried to contact him but so far no response, as I want to credit him for starting this up…

JUCE_ALIGN(16) class sse4
{
public:
	__m128 v;
	//
	forcedinline sse4(float x) : v(_mm_set1_ps(x)) { };
	forcedinline sse4(float *px) : v(_mm_load_ps(px)) { };
	forcedinline sse4(__m128 v) : v(v) { };
	forcedinline void write(float* target) { _mm_store_ps(target, v); };
	forcedinline void set(sse4 value) { v = value.v; };
	forcedinline void operator = (sse4& _v2) { v = _v2.v; };
	forcedinline void operator = (float* _v2) { v = _mm_load_ps(_v2); }
};

forcedinline sse4 operator + (const sse4 &l, const sse4 &r) { return sse4(_mm_add_ps(l.v, r.v)); }
forcedinline sse4 operator - (const sse4 &l, const sse4 &r) { return sse4(_mm_sub_ps(l.v, r.v)); }
forcedinline sse4 operator * (const sse4 &l, const sse4 &r) { return sse4(_mm_mul_ps(l.v, r.v)); }
forcedinline sse4 operator / (const sse4 &l, const sse4 &r) { return sse4(_mm_div_ps(l.v, r.v)); }
forcedinline sse4 operator + (const sse4 &l, const float &r) { return sse4(_mm_add_ps(l.v, _mm_set1_ps(r))); }
forcedinline sse4 operator - (const sse4 &l, const float &r) { return sse4(_mm_sub_ps(l.v, _mm_set1_ps(r))); }
forcedinline sse4 operator * (const sse4 &l, const float &r) { return sse4(_mm_mul_ps(l.v, _mm_set1_ps(r))); }
forcedinline sse4 operator / (const sse4 &l, const float &r) { return sse4(_mm_div_ps(l.v, _mm_set1_ps(r))); }

WilliamkWusik · June 18, 2019, 2:35pm

Here’s the latest code.

https://www.wusik.com/download/SSE_Benchmarking_sse_avx.zip

So far it seems to work great, so I will just make this a JUCE’s module and add some other stuff that I will use. I won’t add everything, only the stuff most used by me.

pflugshaupt · June 18, 2019, 2:39pm

Something to keep in mind when doing SIMD with Visual Studio c++ is that it has troubles fully optimizing various SIMD wrappers. I believe it’s called “empty baseclass problem” or similar and it means as soon as a class is used to wrap SIMD types, that class can never be fully optimized away. I think that is why you are able to beat JUCE simd by 20%. I think on osx or linux with clang and gcc, the results would be a lot closer.
The only way to see what’s going on is looking at the compiled assembly.
For this reason I wrote helpers that don’t use a class, but just add operators to the windows types.
On clang and gcc, built-in simd types already have operators, so most of the intrinsics are not necessary there.

WilliamkWusik · June 18, 2019, 2:49pm

I just created a function so I can use the same code for SSE and AVX. Seems to work ok without adding way too much overhead.

template <class T>
forcedinline void Compute(T& sTestData)
{
	sTestData = (((sTestData + 1.0f) * sTestData) - 0.001f) * 2.82f;
}

and them I use

		sse4 sTestData(testData + x);
		//
		for (int xx = 0; xx < 32; xx++)
		{
			Compute<sse4>(sTestData);
		}
		//
		sTestData.write(testData + x);

or

		avx4 sTestData(testData + x);
		//
		for (int xx = 0; xx < 32; xx++)
		{
			Compute<avx4>(sTestData);
		}
		//
		sTestData.write(testData + x);

WilliamkWusik · June 18, 2019, 2:49pm

I forgot AVX requires 32 bits alignment…

JUCE_ALIGN(32) class avx4

WilliamkWusik · June 18, 2019, 2:50pm

So, my table of floats, should I align to 32 bits so it is SSE and AVX compatible or using 16 bits will do for both?

WilliamkWusik · June 18, 2019, 2:54pm

Edit: fixed the problems.

template <class T>
forcedinline T Compute(const T& sTestData)
{
	return (((sTestData + 1.0f) * sTestData) - 0.001f) * 2.82f;
}

// -------------------------------------------------------------------------------------------------------------------
// -------------------------------------------------------------------------------------------------------------------
// -------------------------------------------------------------------------------------------------------------------
// -------------------------------------------------------------------------------------------------------------------

JUCE_ALIGN(16) class sse4
{
public:
	__m128 v;
	//
	forcedinline sse4(const float x) : v(_mm_set1_ps(x)) { };
	forcedinline sse4(const float *px) : v(_mm_load_ps(px)) { };
	forcedinline sse4(const __m128 v) : v(v) { };
	forcedinline void write(float* target) { _mm_store_ps(target, v); };
	forcedinline void set(const sse4 value) { v = value.v; };
	forcedinline void operator = (const sse4& _v2) { v = _v2.v; };
	forcedinline void operator = (const float* _v2) { v = _mm_load_ps(_v2); }
};

forcedinline sse4 operator + (const sse4 &l, const sse4 &r) { return sse4(_mm_add_ps(l.v, r.v)); }
forcedinline sse4 operator - (const sse4 &l, const sse4 &r) { return sse4(_mm_sub_ps(l.v, r.v)); }
forcedinline sse4 operator * (const sse4 &l, const sse4 &r) { return sse4(_mm_mul_ps(l.v, r.v)); }
forcedinline sse4 operator / (const sse4 &l, const sse4 &r) { return sse4(_mm_div_ps(l.v, r.v)); }
forcedinline sse4 operator + (const sse4 &l, const float &r) { return sse4(_mm_add_ps(l.v, _mm_set1_ps(r))); }
forcedinline sse4 operator - (const sse4 &l, const float &r) { return sse4(_mm_sub_ps(l.v, _mm_set1_ps(r))); }
forcedinline sse4 operator * (const sse4 &l, const float &r) { return sse4(_mm_mul_ps(l.v, _mm_set1_ps(r))); }
forcedinline sse4 operator / (const sse4 &l, const float &r) { return sse4(_mm_div_ps(l.v, _mm_set1_ps(r))); }

WilliamkWusik · June 18, 2019, 4:02pm

Instructions: SSE/AVX / Size of Buffer: 209715200 bytes (200.00 MB)

C++ Time: 18.2475 seconds
SSE (no loop) Time: 4.6734 seconds
SSE (with loop) Time: 4.6704 seconds
SSE JUCE SIMD (no loop) Time: 5.4686 seconds
SSE JUCE SIMD (with loop) Time: 5.3563 seconds
SSE OWN SIMD A (with loop) Time: 4.6749 seconds
SSE OWN SIMD A (no loop) Time: 4.6605 seconds
SSE OWN SIMD B (with loop) Time: 4.6732 seconds
SSE OWN SIMD B (with loop and using 'set') Time: 4.6696 seconds
SSE OWN SIMD B (no loop) Time: 4.6525 seconds
SSE OWN SIMD B (with loop, 'set' and direct math) Time: 4.6778 seconds
SSE OWN SIMD B (with loop and direct math) Time: 4.6727 seconds
AVX OWN (with loop and using 'set') Time: 2.7665 seconds
AVX OWN (with loop, 'set' and direct math) Time: 2.7718 seconds
AVX OWN (with loop and direct math) Time: 2.7692 seconds
SSE OWN (function call) Time: 4.6859 seconds
AVX OWN (function call) Time: 2.7908 seconds
SSE JUCE SIMD (with loop) (*) Again, just in case Time: 5.3515 seconds

Topic		Replies	Views
FloatVectorOperations General JUCE discussion	39	3165	June 23, 2015
[DSP module discussion] New class SIMDRegister General JUCE discussion	10	3273	February 21, 2019
Automated expression to SSE converter Audio Plugins	6	1440	September 22, 2015
Juce on amd64 X2 Linux	66	1898	May 12, 2017
Cross-Platform SIMD / Paralel? General JUCE discussion	5	1135	February 10, 2011

How to organize SSE code better + Loop Unrolling?

Purchase

Discover

Learn

Support

About

Events

How to organize SSE code better + Loop Unrolling?

Related topics

Purchase

Discover

Learn

Support

About

Events