Divide by SIMDRegister

Is there a way to divide by a SIMDRegister? Like can I calculate 1/SIMDReg somehow, or do anybody know about a good workaround?

There’s div in SSE but not in NEON so that’s probably why it’s not supported in SIMDRegister.
If you’re only building for SSE devices than you could just put in the / operators in SIMDRegister.
That or this will work without changing JUCE code:

SIMDRegister<float> quotient = _mm_div_ps(numerator.value, denominator.value);
2 Likes

thank you, that’s perfect and works nicely!

Genuine question (showing my naiveté)– why can’t the SIMDRegister type implement a operator/= which when built against SSE uses _mm_div_ps, and when built against NEON or AVX has a workaround implementation which unpacks the register, performs the division operation for each value, packs the newly acquired values back into the register, then returns?

Obviously, I’m assuming, that would be a performance hit, but if you’re building against NEON, trying to use the SIMDRegister type, and need a 1/x division operation, wouldn’t you have to write something just like this anyway?

1 Like

imho the proper solution would be to support division on SSE platforms and have a compile error on NEON. Trusting a user to know and understand the differences between SIMD platforms kind of defeats the purpose of hiding it behind an API.

3 Likes

That’s fair; I would support that solution.

An SSE only solution is not possible anymore for us because of apples ARM switch. How can we proceed rewriting the code without that fundamental operator?

In my opinion it should be there and calculated without registers when not supported by the processor.

Any workarounds or other solutions welcome! Is it possible to calculate the division somehow different. Maybe with 1 / x operator?

A64 has FDIV and the vdiv intrinsics for floating-point.

how does this help? how can i use this?

The whole SIMDRegister is useless for us without the division operator or the possibility to calculate the reciprocal (1/x) value and it looks to me that adding this feature and modifying the Module is not an easy and maintainable task.

Are there other ways to do this? For example to calculate the reciprocal with the SIMDRegister and the JUCE helper functions? Or does someone have a complete different solution that works for different CPU’s?

Is this something that is already supported in most ARM CPU’s?

ARM’s NEON has floating point division intrinsics vdivq_f32
You can look them here. I don’t know tho if it’s only in ARMv8 or ARMv7 also supports it

Thanks for the information. 64 bit division seems to be there too. So, it looks almost all processors support this.

@t0m: Can we have the division feature for SIMDRegister? You could throw a compiler error if it’s not supported for the CPU like mentioned above.

v8 only. It’s in the A64 instruction set.

I’d say yes, though juce_neon_SIMDNativeOps seems to use A32 intrinsics only. It’s weird that division is not implemented for SSE or AVX. I don’t know if the fallbacks are selected per operation or for the whole set -that may be a reason to exclude operations that some sets don’t have, like division in A32.

Take it with a grain of salt, but I recall the reason was NEON didn’t have division by the time they implemented it, so it didn’t make any sense to implement the SIMD division’s wrapper containing only SSE/AVX intrinsics as the purpose of those is to use them and forget about which platform you are coding for.

It’s clear that the NEON wrapper was made for A32, but there’s a SIMDFallbackOps struct to handle these cases. There could be a SIMDNativeOps::div that calls SIMDFallbackOps::div for NEON. Many things are available in some sets only, like fma, or 256-bit vectors.

2 Likes

So, it’s time to add the division feature?

I was able to overwrite the division operator for the datatypes i needed without changing the juce library code.
This way i can use all the features of the SIMDRegister struct and the division. I wasn’t able to test the ARM version, but we will see if it works soon :slight_smile:

Here is the code:

#pragma once

#include "../JuceLibraryCode/JuceHeader.h"

using vec4 = juce::dsp::SIMDRegister<float>;
using vec2 = juce::dsp::SIMDRegister<double>;

#if defined(__i386__) || defined(__amd64__) || defined(_M_X64) || defined(_X86_) || defined(_M_IX86)
inline vec4 operator / (const vec4 &l, const vec4 &r)
{
    return _mm_div_ps(l.value, r.value);
}

inline vec2 operator / (const vec2 &l, const vec2 &r)
{
    return _mm_div_ps(l.value, r.value);
}

#elif defined(_M_ARM64) || defined (__arm64__) || defined (__aarch64__)
inline vec4 operator / (const vec4 &l, const vec4 &r)
{
    return vdivq_f32(l.value, r.value);
}

inline vec2 operator / (const vec2 &l, const vec2 &r)
{
    return vdivq_f64(l.value, r.value);
}

#else
 #error "SIMD register support not implemented for this platform"
#endif

Still hope that division operator will be added some time. Any input is welcome.

edit: fixed ARM specific code

2 Likes

I think for ARM it should be vdivq_f32 and vdivq_f64. Plain vdivs work on 64 bits (float32x2, float64x1). Also, they work only for A64, so

#elif defined(_M_ARM64) || defined (__arm64__) || defined (__aarch64__)
2 Likes

Thanks for the fixes!