Simple SSE wrapper


#1

Maybe this helps someone to start writing cross platform SSE optimization without messing up the code too much, keep it maintainable and avoiding big additional libraries.

This class can help to calculate two double values parallel (for example a stereo signal). The same is also possible with four floats. It seems to work on Windows and OSX.

#ifndef __SseHelper__h
#define __SseHelper__h

#include "immintrin.h"

// SSE vector.
#if JUCE_WINDOWS
_declspec(align(16)) class vec2
#else
__attribute__((aligned (16))) class vec2
#endif
{
public:
	typedef double T;
	enum { N = 2 };
 
	__m128d v;
 
	vec2() { }
	vec2(double x) : v(_mm_set1_pd(x)) { }
	vec2(double p1, double p2) : v(_mm_set_pd(p1, p2)) { }
	vec2(double *px) : v(_mm_load_pd(px)) { }
	vec2(__m128d v) : v(v) { }
};

inline vec4 operator + (const vec4 &l, const vec4 &r)
{
    return vec4(_mm_add_ps(l.v, r.v));
}
inline vec4 operator - (const vec4 &l, const vec4 &r)
{
    return vec4(_mm_sub_ps(l.v, r.v));
}
inline vec4 operator * (const vec4 &l, const vec4 &r)
{
    return vec4(_mm_mul_ps(l.v, r.v));
}
inline vec4 operator / (const vec4 &l, const vec4 &r)
{
    return vec4(_mm_div_ps(l.v, r.v));
}

#endif

You can use the code this way:

// store stereo input values in SSE register 
vec2 a(*sampleL, *sampleR); 

// define a constant value to multiply 
vec2 b(0.5); 

// calculate things like it would be mono (you need to make a lot more operations; otherwise this makes no sense) 
vec2 result = a * b / a; 

// store values from SSE registers back to local doubles 
double rL; double rR; 
_mm_store1_pd(&rL, result.v); 
_mm_storeh_pd(&rR, result.v); 

// write the stereo values to the output 
*sampleL = rL; 
*sampleR = rR;

It's possible to read and store values directly from an array, but you have to make sure that the array is aligned to 16 bit or you can use special SSE commands that are able to load and store unaligned values.

Any input is welcome.


#2

looks great!


#3

I've started some time ago a simd wrapper library, and will continue development in the next months (now i'm a bit busy with other project).

It's basically a convenient wrapper "math" abstraction that works on buffers of data (int/float/double) and should autodetect the running CPU features and take up the most convenient and faster implementation possible (it's possible to also force a particular usage of a simd instruction set). The project will implement all commons buffer operations (copy, add, mult, swap) and some specific operations suited for audio (peak rms, min max, feedback check, basic filtering, pan laws, dry wet mixing, power spectrum, and so on) and for image manipulation.

    https://github.com/kunitoki/waterspout

Maybe other people also are interested in joining...

Btw, any ideas on how we could make this a better is appreciated :)


#4

Thanks a lot Kraken, your library seems very interesting.

FYI, here are a few other libraries similar to yours that you might want to have a look at too.

  • Vc: Vector Classes (LGPL)¬†http://code.compeng.uni-frankfurt.de/projects/vc
    Seems quite mature and in active development
  • Nova SIMD (GPL)¬†https://github.com/timblechmann/nova-simd
    Looks very good but GPL unfortunately
  • Metascale NT2 / Boost.simd (boost license)¬†https://github.com/MetaScale/nt2
    Looks very promising, covers many domains, is actively developped and will probably be included in a next revision of boost
  • Vecmathlib (MIT)¬†https://bitbucket.org/eschnett/vecmathlib/wiki/Home¬†
  • SLEEF¬†http://shibatch.sourceforge.net

Cheers,

Lorcan

 


#5

Juce also uses SIMD instructions, I guess this is not enough for your purposes? 

Have you tried the Intel Performance Primitives? I use them for a very fast convolution engine and they offer all the stuff that you are mentioning.

Cheers


#6

Can you provide some example on using SSE biltins?

Do they work fine with GGC and clang? Do I have to make sure the buffer is alligned before using it?