SIMD Register size


#1

Is there any way of checking if the SIMD register size is 128bit or if AVX 256bit is supported programatically?


#2

Use SystemStats::hasAVX() (or any similar methods) to check for CPU features at runtime.

https://docs.juce.com/master/classSystemStats.html#ac8b5ff1c9505f12bca684fce44f514b1

For compile-time checking, there’s no standard cross-compiler, cross-platform way of checking CPU features that I’m aware of, and I don’t think JUCE has one either.


#3

Thanks, I was hoping for a method like that. Not bothered about compile time. Just wanted a way of dividing up work evenly for y number of filters across x number of cores…


#4

SIMD is not about number of cores. It’s a Single Instruction that runs on Multiple Data. Since it’s a single instruction it runs on a single core.


#5

Very aware of that… But I want to run essentially 64 x 512 tap FIR filters across 8 cores and parallel the main multiplication and sum of an FIR. If I was to try this just in the audio thread it would keel over hence I’m trying to find a solution…

See my response below for my current benchmarks


#6

I have created this FIR class in which I interleave the tap coefficients… Now this works well on my MacBook pro and I7 desktop. will this optimisation always work if comp has AVX? - the example will optimise well if the size or number of FIR filters is a power of 4…

class SIMDFir {
public:

int numTaps;
int numDomains;
int size;
int width;

float * taps;

SIMDFir(const int _size, float * _interleavedTaps, const int _numTaps) {
    
    numTaps = _numTaps;
    size = _size;
    taps = _interleavedTaps;
    width = numTaps * size;
}
~SIMDFir() {};

inline void process(const float * interleavedIn, float * interleavedOut, const int numSamples) {

    for( int s = 0; s < numSamples; s++ ) {
        
        int sampleOffset = ( size * s );
        
        const float * inSamples = &interleavedIn[ sampleOffset ];
        float * outSamples = &interleavedOut[ sampleOffset ];
            
        for( int t = 0; t < width; t = t + size ) {

            #pragma simd
            for( int i = 0; i < size; i++ ) {
                
                // hopefully get speed boost here as mem access should be uni-stride
                // and memory should be aligned if size is power of 4
                // pragma simd should also hint if intel compiler?
                
                outSamples[ i ] += inSamples[ t + i ] * taps[ t + i ];
            }
        }
    }
}

};

For a file with 4.5 seconds of audio
1 FIR filters takes 206107 ns
2 FIR filters takes 219708 ns
4 FIR filters takes 218369 ns
8 FIR filters takes 297672 ns

So clearly getting vectorised and vastly increasing processing power


#7

If you compile with SSE2 code generation enabled, then it will work on machines with AVX. If you compile with AVX code generation, it won’t work unless the computer has AVX.

You can have SSE2 and AVX in the same binary but there’s a costly hit to switch back and forth. I haven’t done this myself so I’m not sure.

Also, why not use the juce::dsp::SIMDRegister class? It will automatically use SSE2, AVX2, or NEON depending on your compile settings.


#8

I’m a stickler for knowing exactly whats going on under the hood and I don’t quite get how that class works. I have been playing around with it the last couple of days but can’t wield it well enough yet - It’s a short coming on my part.

On the flip side though I did want to learn how to code in a way that takes advantage of this kind of optimisation and it’s much clearer now that I have managed to refactor some of my own DSP classes.