SIMD runtime dispatch avoiding Auto-Vectorization

fberti · October 10, 2023, 10:21am

Hello Devs,

I have a project that heavily relies on SIMD vectorization/inlining (using something similar to the JUCE SIMDRegister abstraction), including AVX instructions.

Since AVX is not available on old processors and Apple Silicon processors, I’m currently providing different binaries for different architectures, but I’d like to evaluate the possiblity of refactoring the code to perform a runtime dispatch to the correct implementation according to what is available on the running machine, using a single binary.

I’m currently trying to get around this on Windows/Visual Studio, then I’ll proceed with XCode if I succeed.

The problem (correct me if I’m wrong please) is that in Visual Studio I need to specify /arch:avx in order to access the AVX instructions, but this has the side effect of being took into account by the Auto-Vectorizer, as stated by Microsoft:

The Auto-Vectorizer analyzes loops in your code, and uses the vector registers and instructions on the target computer to execute them, if it can. This can improve the performance of your code. The compiler targets the SSE2, AVX, and AVX2 instructions in Intel or AMD processors, or the NEON instructions on ARM processors, according to the /arch switch.

Now, if I need to perform a runtime dispatch to the SSE version but the compiler internally vectorizes a (FPU) loop using AVX because I’ve set /arch:avx to enable AVX support, this clearly won’t work.

My question is: how can I enable AVX support at compile time while having the Auto Vectorizer only vectorize using the minimum supported set (ex. just SSE in this case)?

Thanks in advance,
Federico

PluginPenguin · October 10, 2023, 11:44am

It’s actually quite easy. When using MSVC, you can always manually call all intrinsicts for AVX and SSE, no matter what /arch flag you passed. For Clang and GCC there is the __attribute__ ((target (arch))) that has to be added to a function that wants to make use of an architecture specific intrinsic with arch being a string like "avx" or "avx2".

In our in-house vector wrapper we use this macro to make that portable:

github.com

sonible/VCTR/blob/d60bacb088c3121322dcca9dae23be20aaf15c5b/include/vctr/Miscellaneous/CompilerSpecificAttributes.h#L28


      
              You should have received a copy of the GNU Lesser General Public License
              version 3 along with VCTR.  If not, see <https://www.gnu.org/licenses/>.
            ==============================================================================
          */
          
          #define VCTR_TO_STRING(s) #s
          
          #if VCTR_MSVC
          #define VCTR_TARGET(arch)
          #else
          #define VCTR_TARGET(arch) __attribute__ ((target (arch)))
          #endif
          
          // VCTR_FORCEDINLINE enforces full inlining in release builds. In debug builds we want to preserve the possibility
          // to debug, so it does not affect inlining there
          #if VCTR_DEBUG
          #define VCTR_FORCEDINLINE
          #else
          #if VCTR_MSVC
          #define VCTR_FORCEDINLINE __forceinline
          #else

Example usage:

github.com

sonible/VCTR/blob/d60bacb088c3121322dcca9dae23be20aaf15c5b/include/vctr/Expressions/BasicMath/Subtract.h#L131


      
              src.prepareAVXEvaluation();
              singleSIMD.avx = Expression::AVX::broadcast (single);
          }
          
          VCTR_FORCEDINLINE VCTR_TARGET ("avx") AVXRegister<value_type> getAVX (size_t i) const
          requires (archX64 && has::getAVX<SrcType> && Expression::allElementTypesSame && Expression::CommonElement::isRealFloat)
          {
              return Expression::AVX::sub (singleSIMD.avx, src.getAVX (i));
          }
          
          VCTR_FORCEDINLINE VCTR_TARGET ("avx2") AVXRegister<value_type> getAVX (size_t i) const
          requires (archX64 && has::getAVX<SrcType> && Expression::allElementTypesSame && Expression::CommonElement::isInt)
          {
              return Expression::AVX::sub (singleSIMD.avx, src.getAVX (i));
          }
          
          // SSE Implementation
          void prepareSSEEvaluation() const
          requires has::prepareSSEEvaluation<SrcType>
          {
              src.prepareSSEEvaluation();

Of course when doing thing like that it’s now your job to ensure proper dispatching code that only takes the supported code path at runtime.

fberti · October 10, 2023, 1:58pm

Thanks a lot for the reply!

Unfortunately, the solution you suggest is not that easy to implement in my case, as I’m using Clang on both Visual Studio and XCode.

As you probably already know, the problem is the inlining, since everything is wrapped into a “PackedType” class in my code, the class methods need to be inlined in order to perform correctly without additional overheads, but if I call any of the PackedType inlined methods in a method of another class (which includes having the PackedType as a member variable in another class too, since it’s implicitly calling the default contructor), then I need to annotate that method with the __attribute __ ((target (arch))) as well or it won’t compile, forcing me to refactor a lot of additional code and making it a mess, as I’m using the PackedType in a lot of my math solvers/utility classes.

Topic		Replies	Views
How to support different SIMD architectures? General JUCE discussion	17	524	April 9, 2024
How to organize SSE code better + Loop Unrolling? Development	56	3773	June 24, 2019
Google highway simd library Useful Tools and Components	3	1211	December 16, 2023
It seems that it is important to specify the enhanced instruction set when using General JUCE discussion	6	1129	November 26, 2022
Cross-Platform SIMD / Paralel? General JUCE discussion	5	1150	February 10, 2011

SIMD runtime dispatch avoiding Auto-Vectorization

Purchase

Discover

Learn

Support

About

Events

SIMD runtime dispatch avoiding Auto-Vectorization

Related topics

Purchase

Discover

Learn

Support

About

Events