Just trying to port a bit of DSP to the juce::SIMDRegister stuff to make it easier to do an ARM version.
There’s no _mm_set1_ps(...) in SIMDRegister for __m128 sized floats. Is this an oversight or is _mm_load1_ps equally fast for some reason I don’t understand?
Can we have _mm_cvtepi32_ps() and it’s equivalents added as operations. I’m using it to do some hackery see below for a fast log and exp function.
Has anyone found a way to do CPU dispatching for Intel with the SIMDRegister stuff.
The documentation is completely missing for some relatively non-obvious functions, e.g. swapevenodd
union Uni {
__m128 asFloat;
__m128i asInt; //< i don't think it matters that this is signed/unsigned
};
/**
* Approximately calculate log2.
*
* This uses the approximation:
* union { float f; uint32_t i; } vx = { x }; float y = vx.i;
* y *= 1.0 / (1 << 23); return y - 126.94269504f;
*/
static __m128 log2 (__m128 x)
{
const static auto f = _mm_set1_ps (1.0f / float (1 << 23));
Uni xu;
xu.asFloat = x;
__m128 y = _mm_cvtepi32_ps (xu.asInt);
y = _mm_mul_ps (y, f);
return _mm_sub_ps (y, _mm_set1_ps (126.94269504f));
}
Same here. I was able to rewrite our DSP code by adding the division operator. But i wasn’t able to translate our SSE exp approximation with the JUCE SIMD classes so far.
Also be carefully when using the JUCE SIMD classes. You maybe have to use the clang compiler also on windows. The standard VS compiler was not able to strip the wrapper away in our case and it was much slower. There is also a forum post around about this.
Well we need code that does Windows and Mac so that makes using vDSP awkward.
I think my port to SMIDRegister last night was probably a mistake, although I did get it working it’s a bit of a mess and your note about Windows failing to inline things properly worries me.
With MIPP it says the width of the registers is dependent on the architecture. I’m guessing but I can’t find the documenation to confirm it that, on the M1 we will get a 128bit 4 float wide register? If the build architecture supports 256 wide registers will it still build code that’s expecting 4 floats. It’s a bit unclear to me