SIMDRegister - feedback and questions

Just trying to port a bit of DSP to the juce::SIMDRegister stuff to make it easier to do an ARM version.

  1. There’s no _mm_set1_ps(...) in SIMDRegister for __m128 sized floats. Is this an oversight or is _mm_load1_ps equally fast for some reason I don’t understand?

  2. Can we have _mm_cvtepi32_ps() and it’s equivalents added as operations. I’m using it to do some hackery see below for a fast log and exp function.

  3. Has anyone found a way to do CPU dispatching for Intel with the SIMDRegister stuff.

  4. The documentation is completely missing for some relatively non-obvious functions, e.g. swapevenodd

  5. OMG There’s no division at all! Divide by SIMDRegister - #20 by kunz

    union Uni {
        __m128 asFloat;
        __m128i asInt; //< i don't think it matters that this is signed/unsigned
    };
    
    
    /**
     * Approximately calculate log2.
     *
     * This uses the approximation:
     *   union { float f; uint32_t i; } vx = { x }; float y = vx.i;
     *   y *= 1.0 / (1 << 23); return y - 126.94269504f;
     */
    static __m128 log2 (__m128 x)
    {
        const static auto f = _mm_set1_ps (1.0f / float (1 << 23));
    
        Uni xu;
        xu.asFloat = x;
        __m128 y = _mm_cvtepi32_ps (xu.asInt);
        y = _mm_mul_ps (y, f);
    
        return _mm_sub_ps (y, _mm_set1_ps (126.94269504f));
    }
    
2 Likes

Same here. I was able to rewrite our DSP code by adding the division operator. But i wasn’t able to translate our SSE exp approximation with the JUCE SIMD classes so far.

Also be carefully when using the JUCE SIMD classes. You maybe have to use the clang compiler also on windows. The standard VS compiler was not able to strip the wrapper away in our case and it was much slower. There is also a forum post around about this.

1 Like

for SIMD, I don’t really think the Juce SIMD register classes are necessarily the best tool available to us… if you’re on an Apple platform, there is little reason not to use the vDSP library. For everything else, I would recommend MIPP: GitHub - aff3ct/MIPP: MIPP is a portable wrapper for SIMD instructions written in C++11. It supports NEON, SSE, AVX and AVX-512.

It’s completely portable, and features a much richer instruction set than the Juce SIMD classes.

2 Likes

do you know how it compares to xsimd ?

Well we need code that does Windows and Mac so that makes using vDSP awkward.

I think my port to SMIDRegister last night was probably a mistake, although I did get it working it’s a bit of a mess and your note about Windows failing to inline things properly worries me.

This (GitHub - DLTcollab/sse2neon: A translator from Intel SSE intrinsics to Arm/Aarch64 NEON implementation) looks useful, and very easy! I may try it first and see how it looks :slight_smile:

With MIPP it says the width of the registers is dependent on the architecture. I’m guessing but I can’t find the documenation to confirm it that, on the M1 we will get a 128bit 4 float wide register? If the build architecture supports 256 wide registers will it still build code that’s expecting 4 floats. It’s a bit unclear to me :slight_smile: