+1 for this, unfortunately the described approach - compiling source files multiple times with different parameters - is the only “official” way to do something like this.
I use a similar pattern, templating dsp and vector graphics routines with “isa” types that provide compile-time functions and types matching a specific instruction set permutation. At entry points, I use a sort of visitor pattern to invoke the correct template based on detected features.
My system avoids setting up buildsystems, and the compiler basically generates optimal code for all permutations of FMA aware processors (the ISA provides templated aliases of intrinsics for any T), AVX, SSE etc. without having to split up code into separate files and such, making this process much less painful.
But it required a custom made simd-lib, and it is at the grace of the compiler - some, like clang, really don’t work well with mixing different vector instruction set targets inside the same translation unit, so it requires some massaging + some attributes here and there.
Here’s an example of such a templated routine:
Needless to say, this also “bloats” your binary for every combination of variables on the CPU, you want to support.