Simple SSE wrapper

Maybe this helps someone to start writing cross platform SSE optimization without messing up the code too much, keep it maintainable and avoiding big additional libraries.

This class can help to calculate two double values parallel (for example a stereo signal). The same is also possible with four floats. It seems to work on Windows and OSX.

#ifndef __SseHelper__h
#define __SseHelper__h

#include "immintrin.h"

// SSE vector.
#if JUCE_WINDOWS
_declspec(align(16)) class vec2
#else
__attribute__((aligned (16))) class vec2
#endif
{
public:
	typedef double T;
	enum { N = 2 };
 
	__m128d v;
 
	vec2() { }
	vec2(double x) : v(_mm_set1_pd(x)) { }
	vec2(double p1, double p2) : v(_mm_set_pd(p1, p2)) { }
	vec2(double *px) : v(_mm_load_pd(px)) { }
	vec2(__m128d v) : v(v) { }
};

inline vec4 operator + (const vec4 &l, const vec4 &r)
{
    return vec4(_mm_add_ps(l.v, r.v));
}
inline vec4 operator - (const vec4 &l, const vec4 &r)
{
    return vec4(_mm_sub_ps(l.v, r.v));
}
inline vec4 operator * (const vec4 &l, const vec4 &r)
{
    return vec4(_mm_mul_ps(l.v, r.v));
}
inline vec4 operator / (const vec4 &l, const vec4 &r)
{
    return vec4(_mm_div_ps(l.v, r.v));
}

#endif

You can use the code this way:

// store stereo input values in SSE register 
vec2 a(*sampleL, *sampleR); 

// define a constant value to multiply 
vec2 b(0.5); 

// calculate things like it would be mono (you need to make a lot more operations; otherwise this makes no sense) 
vec2 result = a * b / a; 

// store values from SSE registers back to local doubles 
double rL; double rR; 
_mm_store1_pd(&rL, result.v); 
_mm_storeh_pd(&rR, result.v); 

// write the stereo values to the output 
*sampleL = rL; 
*sampleR = rR;

It's possible to read and store values directly from an array, but you have to make sure that the array is aligned to 16 bit or you can use special SSE commands that are able to load and store unaligned values.

Any input is welcome.

1 Like

looks great!

I've started some time ago a simd wrapper library, and will continue development in the next months (now i'm a bit busy with other project).

It's basically a convenient wrapper "math" abstraction that works on buffers of data (int/float/double) and should autodetect the running CPU features and take up the most convenient and faster implementation possible (it's possible to also force a particular usage of a simd instruction set). The project will implement all commons buffer operations (copy, add, mult, swap) and some specific operations suited for audio (peak rms, min max, feedback check, basic filtering, pan laws, dry wet mixing, power spectrum, and so on) and for image manipulation.

    https://github.com/kunitoki/waterspout

Maybe other people also are interested in joining...

Btw, any ideas on how we could make this a better is appreciated :)

Thanks a lot Kraken, your library seems very interesting.

FYI, here are a few other libraries similar to yours that you might want to have a look at too.

  • Vc: Vector Classes (LGPL) http://code.compeng.uni-frankfurt.de/projects/vc
    Seems quite mature and in active development
  • Nova SIMD (GPL) https://github.com/timblechmann/nova-simd
    Looks very good but GPL unfortunately
  • Metascale NT2 / Boost.simd (boost license) https://github.com/MetaScale/nt2
    Looks very promising, covers many domains, is actively developped and will probably be included in a next revision of boost
  • Vecmathlib (MIT) https://bitbucket.org/eschnett/vecmathlib/wiki/Home 
  • SLEEF http://shibatch.sourceforge.net

Cheers,

Lorcan

 

Juce also uses SIMD instructions, I guess this is not enough for your purposes? 

Have you tried the Intel Performance Primitives? I use them for a very fast convolution engine and they offer all the stuff that you are mentioning.

Cheers

Can you provide some example on using SSE biltins?

Do they work fine with GGC and clang? Do I have to make sure the buffer is alligned before using it?

Talk about resurecting an old thread. :wink: But I loved this idea, and I’m adding my own spice to it. Maybe will release as a free JUCE Module. How should I credit the original author?

Here’s what I have done so far to test a few things out.

JUCE_ALIGN(16) class sse4
{
public:
	typedef float T;
	__m128 v;
	//
	forcedinline sse4(float x) : v(_mm_set1_ps(x)) { }
	forcedinline sse4(float *px) : v(_mm_load_ps(px)) { }
	forcedinline sse4(__m128 v) : v(v) { }
	forcedinline void write(float* target) { _mm_store_ps(target, v); }
	forcedinline void set(sse4 value) { v = value.v; }
};

forcedinline sse4 operator + (const sse4 &l, const sse4 &r)
{
	return sse4(_mm_add_ps(l.v, r.v));
}

forcedinline sse4 operator - (const sse4 &l, const sse4 &r)
{
	return sse4(_mm_sub_ps(l.v, r.v));
}
forcedinline sse4 operator * (const sse4 &l, const sse4 &r)
{
	return sse4(_mm_mul_ps(l.v, r.v));
}

forcedinline sse4 operator / (const sse4 &l, const sse4 &r)
{
	return sse4(_mm_div_ps(l.v, r.v));
}

How do I add a = option? So let’s say I have two sse4 variables, and I want to go like var1 = var2 or even var1 = var2 + var3 * var4?

erm… or just use juce::dsp::SIMDRegister?

https://docs.juce.com/master/structdsp_1_1SIMDRegister.html