SIMD runtime dispatch avoiding Auto-Vectorization

Hello Devs,

I have a project that heavily relies on SIMD vectorization/inlining (using something similar to the JUCE SIMDRegister abstraction), including AVX instructions.

Since AVX is not available on old processors and Apple Silicon processors, I’m currently providing different binaries for different architectures, but I’d like to evaluate the possiblity of refactoring the code to perform a runtime dispatch to the correct implementation according to what is available on the running machine, using a single binary.

I’m currently trying to get around this on Windows/Visual Studio, then I’ll proceed with XCode if I succeed.

The problem (correct me if I’m wrong please) is that in Visual Studio I need to specify /arch:avx in order to access the AVX instructions, but this has the side effect of being took into account by the Auto-Vectorizer, as stated by Microsoft:

The Auto-Vectorizer analyzes loops in your code, and uses the vector registers and instructions on the target computer to execute them, if it can. This can improve the performance of your code. The compiler targets the SSE2, AVX, and AVX2 instructions in Intel or AMD processors, or the NEON instructions on ARM processors, according to the /arch switch.

Now, if I need to perform a runtime dispatch to the SSE version but the compiler internally vectorizes a (FPU) loop using AVX because I’ve set /arch:avx to enable AVX support, this clearly won’t work.

My question is: how can I enable AVX support at compile time while having the Auto Vectorizer only vectorize using the minimum supported set (ex. just SSE in this case)?

Thanks in advance,

It’s actually quite easy. When using MSVC, you can always manually call all intrinsicts for AVX and SSE, no matter what /arch flag you passed. For Clang and GCC there is the __attribute__ ((target (arch))) that has to be added to a function that wants to make use of an architecture specific intrinsic with arch being a string like "avx" or "avx2".

In our in-house vector wrapper we use this macro to make that portable:

Example usage:

Of course when doing thing like that it’s now your job to ensure proper dispatching code that only takes the supported code path at runtime.

1 Like

Thanks a lot for the reply!

Unfortunately, the solution you suggest is not that easy to implement in my case, as I’m using Clang on both Visual Studio and XCode.

As you probably already know, the problem is the inlining, since everything is wrapped into a “PackedType” class in my code, the class methods need to be inlined in order to perform correctly without additional overheads, but if I call any of the PackedType inlined methods in a method of another class (which includes having the PackedType as a member variable in another class too, since it’s implicitly calling the default contructor), then I need to annotate that method with the __attribute __ ((target (arch))) as well or it won’t compile, forcing me to refactor a lot of additional code and making it a mess, as I’m using the PackedType in a lot of my math solvers/utility classes.