Maybe this helps someone to start writing cross platform SSE optimization without messing up the code too much, keep it maintainable and avoiding big additional libraries.

This class can help to calculate two double values parallel (for example a stereo signal). The same is also possible with four floats. It seems to work on Windows and OSX.

#ifndef __SseHelper__h #define __SseHelper__h #include "immintrin.h" // SSE vector. #if JUCE_WINDOWS _declspec(align(16)) class vec2 #else __attribute__((aligned (16))) class vec2 #endif { public: typedef double T; enum { N = 2 }; __m128d v; vec2() { } vec2(double x) : v(_mm_set1_pd(x)) { } vec2(double p1, double p2) : v(_mm_set_pd(p1, p2)) { } vec2(double *px) : v(_mm_load_pd(px)) { } vec2(__m128d v) : v(v) { } }; inline vec4 operator + (const vec4 &l, const vec4 &r) { return vec4(_mm_add_ps(l.v, r.v)); } inline vec4 operator - (const vec4 &l, const vec4 &r) { return vec4(_mm_sub_ps(l.v, r.v)); } inline vec4 operator * (const vec4 &l, const vec4 &r) { return vec4(_mm_mul_ps(l.v, r.v)); } inline vec4 operator / (const vec4 &l, const vec4 &r) { return vec4(_mm_div_ps(l.v, r.v)); } #endif

You can use the code this way:

// store stereo input values in SSE register vec2 a(*sampleL, *sampleR); // define a constant value to multiply vec2 b(0.5); // calculate things like it would be mono (you need to make a lot more operations; otherwise this makes no sense) vec2 result = a * b / a; // store values from SSE registers back to local doubles double rL; double rR; _mm_store1_pd(&rL, result.v); _mm_storeh_pd(&rR, result.v); // write the stereo values to the output *sampleL = rL; *sampleR = rR;

It's possible to read and store values directly from an array, but you have to make sure that the array is aligned to 16 bit or you can use special SSE commands that are able to load and store unaligned values.

Any input is welcome.