Simple SSE wrapper

patrickkunz · February 7, 2014, 8:14am

Maybe this helps someone to start writing cross platform SSE optimization without messing up the code too much, keep it maintainable and avoiding big additional libraries.

This class can help to calculate two double values parallel (for example a stereo signal). The same is also possible with four floats. It seems to work on Windows and OSX.

#ifndef __SseHelper__h
#define __SseHelper__h

#include "immintrin.h"

// SSE vector.
#if JUCE_WINDOWS
_declspec(align(16)) class vec2
#else
__attribute__((aligned (16))) class vec2
#endif
{
public:
	typedef double T;
	enum { N = 2 };
 
	__m128d v;
 
	vec2() { }
	vec2(double x) : v(_mm_set1_pd(x)) { }
	vec2(double p1, double p2) : v(_mm_set_pd(p1, p2)) { }
	vec2(double *px) : v(_mm_load_pd(px)) { }
	vec2(__m128d v) : v(v) { }
};

inline vec4 operator + (const vec4 &l, const vec4 &r)
{
    return vec4(_mm_add_ps(l.v, r.v));
}
inline vec4 operator - (const vec4 &l, const vec4 &r)
{
    return vec4(_mm_sub_ps(l.v, r.v));
}
inline vec4 operator * (const vec4 &l, const vec4 &r)
{
    return vec4(_mm_mul_ps(l.v, r.v));
}
inline vec4 operator / (const vec4 &l, const vec4 &r)
{
    return vec4(_mm_div_ps(l.v, r.v));
}

#endif

You can use the code this way:

// store stereo input values in SSE register 
vec2 a(*sampleL, *sampleR); 

// define a constant value to multiply 
vec2 b(0.5); 

// calculate things like it would be mono (you need to make a lot more operations; otherwise this makes no sense) 
vec2 result = a * b / a; 

// store values from SSE registers back to local doubles 
double rL; double rR; 
_mm_store1_pd(&rL, result.v); 
_mm_storeh_pd(&rR, result.v); 

// write the stereo values to the output 
*sampleL = rL; 
*sampleR = rR;

It's possible to read and store values directly from an array, but you have to make sure that the array is aligned to 16 bit or you can use special SSE commands that are able to load and store unaligned values.

Any input is welcome.

chkn · February 7, 2014, 9:25am

looks great!

kunitoki · February 11, 2014, 9:16am

I've started some time ago a simd wrapper library, and will continue development in the next months (now i'm a bit busy with other project).

It's basically a convenient wrapper "math" abstraction that works on buffers of data (int/float/double) and should autodetect the running CPU features and take up the most convenient and faster implementation possible (it's possible to also force a particular usage of a simd instruction set). The project will implement all commons buffer operations (copy, add, mult, swap) and some specific operations suited for audio (peak rms, min max, feedback check, basic filtering, pan laws, dry wet mixing, power spectrum, and so on) and for image manipulation.

https://github.com/kunitoki/waterspout

Maybe other people also are interested in joining...

Btw, any ideas on how we could make this a better is appreciated :)

lorcan · February 12, 2014, 6:42pm

Thanks a lot Kraken, your library seems very interesting.

FYI, here are a few other libraries similar to yours that you might want to have a look at too.

Vc: Vector Classes (LGPL) http://code.compeng.uni-frankfurt.de/projects/vc
Seems quite mature and in active development
Nova SIMD (GPL) https://github.com/timblechmann/nova-simd
Looks very good but GPL unfortunately
Metascale NT2 / Boost.simd (boost license) https://github.com/MetaScale/nt2
Looks very promising, covers many domains, is actively developped and will probably be included in a next revision of boost
Vecmathlib (MIT) https://bitbucket.org/eschnett/vecmathlib/wiki/Home
SLEEF http://shibatch.sourceforge.net

Cheers,

Lorcan

Peter_Emanuel_Roos · February 13, 2014, 6:25am

Juce also uses SIMD instructions, I guess this is not enough for your purposes?

Have you tried the Intel Performance Primitives? I use them for a very fast convolution engine and they offer all the stuff that you are mentioning.

Cheers

jrrossi · September 15, 2015, 4:44pm

Can you provide some example on using SSE biltins?

Do they work fine with GGC and clang? Do I have to make sure the buffer is alligned before using it?

WilliamkWusik · June 17, 2019, 9:40pm

Talk about resurecting an old thread. But I loved this idea, and I’m adding my own spice to it. Maybe will release as a free JUCE Module. How should I credit the original author?

Here’s what I have done so far to test a few things out.

JUCE_ALIGN(16) class sse4
{
public:
	typedef float T;
	__m128 v;
	//
	forcedinline sse4(float x) : v(_mm_set1_ps(x)) { }
	forcedinline sse4(float *px) : v(_mm_load_ps(px)) { }
	forcedinline sse4(__m128 v) : v(v) { }
	forcedinline void write(float* target) { _mm_store_ps(target, v); }
	forcedinline void set(sse4 value) { v = value.v; }
};

forcedinline sse4 operator + (const sse4 &l, const sse4 &r)
{
	return sse4(_mm_add_ps(l.v, r.v));
}

forcedinline sse4 operator - (const sse4 &l, const sse4 &r)
{
	return sse4(_mm_sub_ps(l.v, r.v));
}
forcedinline sse4 operator * (const sse4 &l, const sse4 &r)
{
	return sse4(_mm_mul_ps(l.v, r.v));
}

forcedinline sse4 operator / (const sse4 &l, const sse4 &r)
{
	return sse4(_mm_div_ps(l.v, r.v));
}

WilliamkWusik · June 17, 2019, 9:41pm

How do I add a = option? So let’s say I have two sse4 variables, and I want to go like var1 = var2 or even var1 = var2 + var3 * var4?

jules · June 18, 2019, 7:10am

erm… or just use juce::dsp::SIMDRegister?

https://docs.juce.com/master/structdsp_1_1SIMDRegister.html

Topic		Replies	Views
SSE optimization General JUCE discussion	4	954	February 5, 2017
Cross-Platform SIMD / Paralel? General JUCE discussion	5	1135	February 10, 2011
DoubleVectorOperations General JUCE discussion	5	1041	July 31, 2013
FloatVectorOperations General JUCE discussion	39	3166	June 23, 2015
How to organize SSE code better + Loop Unrolling? Development	56	3713	June 24, 2019

Simple SSE wrapper

Purchase

Discover

Learn

Support

About

Events

Simple SSE wrapper

Related topics

Purchase

Discover

Learn

Support

About

Events