How to support different SIMD architectures?

Hi,
I would like to use SIMD. It should work on different architectures. Like SSE, AVX2 and AVX512.
If I understand correctly from this forum then JUCE checks the architecture at compile time. But I want my plugin to run on different kind of archtiectures, depending on that machine the user has. Can that be done?

SystemStats class has checks for those type of things. It pulls from both compile-time defines and run-time cpuid checks, etc.

Yes I can check run-time which code to run. But the compiler also needs to know how to compile each version of the implementation. How to do that?

Firstly, it’s probably not worth it. Every x86 CPU released for about a decade has supported AVX2 while only workstation/server class CPUs have supported AVX512.

If you do want to support AVX512 you just need to use the AVX512 intrinsics in immintrin.h and the compiler doesn’t care. There are a few variants of AVX512 so you need to be careful which you use.

If you want to use JUCE’s SIMD helper classes you can compile separate .cpp files for each variant you care about and provide the appropriate -m or /arch compiler options (check the compiler man page/docs for exact syntax).

If you want to research this more, it’s called “function multiversioning” and there are some pitfalls. It’s expensive to dispatch on each method, so you should really only separately optimize big chunks of work to use different SIMD instructions, and when you do that, you’ll need to benchmark a bunch by hand to find out if it’s actually faster.

Since this is only meaningful for performance it’s kind of a waste of time to think about it before having a baseline to benchmark against.

Can you explain this a bit more? I do not understand what you mean by this.
Are this compiler options written in a code file?

I did try this, but as soon as two translation units (cpp files) tried to link to the same standard library it all fell apart (the linker optimised away the non-AVX copy of the lib which caused the non-AVX codepath in the plugin to call into the AVX version of the library, which crashes on AVX).
The only reliable method that I’ve found to support more than one architecture in the same plugin is to firewall the AVX code inside its own dll separate from the rest of the plugin.

I tried doing this a bit after I wrote it because I didn’t think too hard, it’s a violation of the ODR. A workaround is to use __atribute__((always_inline)) on all callees and to write a wrapper that calls those callees, use add_library(wrapper_avx OBJECT wrapper.cpp) and target_compile_options(wrapper_avx PUBLIC -mavx) in CMake (and variants for sse, avx512, and -mno-avx -mno-sse3, etc for the scalar fallback). It will work so long as you can force the callees to be inlined. Without forced inlining you get silent ODR violations and one implementation gets picked by the linker.

I you can’t force that (I didn’t try for real with juce::dsp::SIMDregister, but I kind of doubt you can force this to work) I think there’s an incantation with linker arguments to rename symbols or to use a linker script, but I didn’t try it. I thought it would be easier at first given that you can compile the object code for the SIMD class multiple times with different arch args and link against a wrapper that’s only partially implemented until you link all the object together, but I was wrong and that’s not possible without some additional and non portable hackery.

Regardless, trying to convert library code that uses compile time specialization into selecting at runtime with no code changes is pretty hard, and I don’t think it’s worth it for this use case in the first place.

I’ve used the technique of compiling the same module multiple times, with different arch arguments to the compiler. It does work and then you just need to dispatch the proper variant based on cpuid. Used this for many functions, typically processing between 8x8 and 64x64 pixels. It would typically be between 400 and 800% faster than plain C w/out SIMD. Verified through micro benchmarks on a wide variety of processors, including with different cache pattern simulations. So it is definitely worth it if you have a bottleneck on those routines.

2 Likes

Do you have a quick example? I tried this naively on linux and the linker selected the wrong implementations without forced inlining.

Did you create a separate .cpp file for each version? This cpp file can pull in the actual implementations, and you can use preprocessor macros to vary the function names. Doing this, it will be a fully separate module for each, as far as the compiler is concerned.

  • routines_avx.cpp
    • #define ROUTINE_TYPE SSE
    • #include “routines.inl”
  • routines_sse.cpp
    • #define ROUTINE_TYPE AVX
    • #include “routines.inl”
  • routines.inl
    • extern “C” fast_routine_##ROUTINE_TYPE(…) { /* implementation */ }

Something like that.

Would this work for MSVC compiler (using Visual Studio Code) too?

I used one source but it shouldn’t matter, since this is the linker’s behavior and I’m generating multiple object files.

// header.h
#pragma once
#include <stdio.h>
#if defined FOO
  void f() { printf("foo\n"); }
#elif defined BAR
  void f() { printf("bar\n"); }
#endif

And try and wrap it like so

extern void foo();
extern void bar();

#if defined FOO
  #include "header.h"
  void foo() { f(); }
#elif defined BAR
  #include "header.h"
  void bar() { f(); }
#else
  int main() {
    foo();
    bar();
  }
#endif

Compiling it manually

cc main.cpp -DFOO -c -o foo.o
cc main.cpp -DBAR -c -o bar.o
cc main.cpp foo.o bar.o 

Fails with the expected error:

/usr/bin/ld: bar.o: in function `f()':
main.cpp:(.text+0x0): multiple definition of `f()'; foo.o:main.cpp:(.text+0x0): first defined here
collect2: error: ld returned 1 exit status

The quickest way to resolve this is to mark the functions static and that works fine, but that’s not appropriate for this case (using juce’s SIMD helpers means using classes, not static methods).

If you change the code to do something like this

// header.h
#pragma once
#include <stdio.h>
class C {
public:
  C() = default;
  ~C() = default;
#if defined FOO
  void f() { printf("foo\n"); }
#elif defined BAR
  void f() { printf("bar\n"); }
#endif
};

//main.cpp
extern void foo();
extern void bar();

#if defined FOO
  #include "header.h"
  void foo() { C c; c.f(); }
#elif defined BAR
  #include "header.h"
  void bar() { C c; c.f(); }
#else
  int main() {
    foo();
    bar();
  }
#endif

And compile the exact same way, you don’t get a linker error but the program prints:

foo
foo

Because the implementation of C::f that gets linked is the first that the linker finds and not the implementation that should have been called by bar(). Splitting into separate source files doesn’t fix this. Of course renaming the functions based on the defines would be fine but that’s not the same problem.

Look up “dynamic dispatch” for different strategies on leveraging different instruction sets at runtime. There are some other options besides those that Jeff and Caustik mention.

There are other challenges to solve besides that though. If you want to dynamically support different channel counts, then you need to consider different interleaving strategies and how you instantiate all the different objects to support each scenario. The avec library does some of this but doesn’t implement dynamic dispatch.

I’m actually writing my own library to do this within JUCE:

These classes help manage SIMD processing in a paradigm where we favour runtime choices over compile time choices. Platform architecture is chosen at compile time (e.g. x86, x64, ARM). Optimal instruction sets are determined at run time according to CPU & code capabilities. This allows a single binary per platform architecture. e.g. for x64, we detect AVX512, AVX & SSE and use the best available.

My intent is to release this to the community once I’ve got it into shape, but it’s a bit of an on-again, off-again thing :slight_smile: So far, I’ve got the SIMD management and interleaving all working, but I need to sort out the dynamic dispatch and a recommended build process.

1 Like

You can’t just merge two different versions of the same symbol into an executable. The symbol ultimately just translates to a memory address and each SIMD implementation has its own memory address so you need multiple symbols (function names).

If you are concerned about the inefficiency of making a function call each time, and so you want the compiler to treat the functions as inline, you’ll need to make the caller split by SIMD type instead. There isn’t going to be any getting around this.

You can’t just merge two different versions of the same symbol into an executable

I’ve shown above how you can and the problems with doing it naively. You need to tell the linker they are different, or force the compiler to not emit the symbols at all.

If you are concerned about the inefficiency of making a function call each time, and so you want the compiler to treat the functions as inline

I’m concerned with OP’s original question, which is how to convert compile time specialization into runtime checks. I thought it would be possible using some build time hacks, but it isn’t without a linker script, static functions, or forced inlining.

For example with forced inlining, it’s not about performance, but guaranteeing that the symbols don’t show up in the object files before being linked.

I don’t really think any of this is a good idea, which is why in my first comment I said it’s not really worth it to go down this road.

I don’t really think any of this is a good idea, which is why in my first comment I said it’s not really worth it to go down this road.

It depends on your use case. If you’re writing audio plugins and most of your users have consumer PCs, then it may not be worth the bother. If you’re working on a synth with heaps of voices, or a host, or if you support power users with Xeon CPUs, you may have a different view!

1 Like

I have the feeling all these solutions have the goal to use one piece of source that is used to generate the different version automatically. That is not my goal.

Yes, I planned to write about 3 separate versions. One for SSE2, one for AVX2 and (probably) one for AVX512.

I can check at run-time which version to call. I will not have a lot of functions that need to be dispatched. Just one or two complete algorithms. And I will use compiler intrinsics for it.

But I assume the compiler needs to know for which architecture it is compiling the code so that it can use the correct SIMD registers and do some optimization. And I do not know how to do that. There is a compiler option to specify the architecture for the whole application. But I have pieces of code that would need different compiler parameters, right? Can I specify this per file or per piece of code?

You can specify it per file, they should link together just fine so long as they have different function names. I’ve not used any “per piece of code” techniques, some compilers have pragmas or type decorators for that sort of thing, but I’ve never researched it deeply.