Compiling and distributing plugins/apps for several CPU architectures

Hey everyone!
I talked to @Fabian at ADC’17 about how to support several cpu architectures and optimize for each of them. Especially when using SIMD optimization, not every architecture has the same SIMD-register width. With setting different compile flags, compiling for specific architectures is easy but I was not sure how to distribute only one file (e.g. .dll or .vst), which holds all these optimizations for each architecture.

So @Fabian thought we should discuss this here in the forum to address a bigger audience, as this would be interesting for almost everyone.

2 Likes

Well a brief summary of one approach is to template your audio processing routines based on the widest instruction set supported. Then you use some compiler macros or other build system shenanigans to compile a subset of source files multiple times with different compiler flags including a separate template parameter for each. Finally you use a run time switch, as far away from your audio processing code as possible, to pick the correct template type.

1 Like

+1 for this, unfortunately the described approach - compiling source files multiple times with different parameters - is the only “official” way to do something like this.

I use a similar pattern, templating dsp and vector graphics routines with “isa” types that provide compile-time functions and types matching a specific instruction set permutation. At entry points, I use a sort of visitor pattern to invoke the correct template based on detected features.

My system avoids setting up buildsystems, and the compiler basically generates optimal code for all permutations of FMA aware processors (the ISA provides templated aliases of intrinsics for any T), AVX, SSE etc. without having to split up code into separate files and such, making this process much less painful.
But it required a custom made simd-lib, and it is at the grace of the compiler - some, like clang, really don’t work well with mixing different vector instruction set targets inside the same translation unit, so it requires some massaging + some attributes here and there.

Here’s an example of such a templated routine:
https://bitbucket.org/Mayae/signalizer/src/master/Source/Vectorscope/VectorscopeRendering.cpp#VectorscopeRendering.cpp-456

Needless to say, this also “bloats” your binary for every combination of variables on the CPU, you want to support.

+1 to what Tom wrote. Most compilers support some form of enabling/disabling machine target options for specific functions, but support for this is limited and sometimes a bit unpredictable (for example, what happens when these functions are inlined).

I find that it more reliable to simply compile the processing part (basically the processBlock part) of the plug-in several times with different target flags. Each time you compile, you also change a template parameter of your top-level entry class (for example via a pre-processor define that you provide to the compiler command-line) so that you don’t get any symbol clashes when linking everything together. You then figure out which CPU your plug-in is running on when prepareToPlay is called and instantiate the processing part with the correct template parameter.