SIMDRegister is it worth it?

Hello, I was curious if anyone uses SIMDRegister for audio processing? Do you have to interleave the samples? Is the performance worth converting an entire project over? Does it have to be in a size that is power of 2?

If anyone has some guidance on how to take a big project’s processing down using SIMD that would really help… or is this not the path?

Thanks!

1 Like

It depends on what you do. If you only play audio (an audio player, a sampler…) and/or have a lot of memory dependency (i.e: audio files, big wavetables) you better try to improve memory access to prevent as much cache misses as possible. On the other hand anything that can be parallelized and includes calculus (even if it’s just a few multiplications and accumulations) can take advantage of it, specially if it happens in every voice or across a lot of samples.

If you have a big project you want to optimize, first profile what takes the most CPU and optimize that. From there you can optimize as much as you want as long as you have time for it and you consider the time spent worth it.

About SIMDRegister, if it’s an iOS/Android app I’d suggest to go for NEON assembly instead of SIMD intrinsics, as compilers still seem to do a pretty bad job with NEON intrinsics while they do a good job with AVX/SSE.

Hi! If you have tight inner loops that do a lot of number crunching, SIMDRegister offers a fairly simple way to gain some improvements in a relatively cross platform safe manner. I’ve recently achieved some speedup for direct convolution. Note that you need to profile this in a release build and make sure that the memory you’re accessing is properly aligned. Multiply-adds with block sizes of 2ˆn like 32 or 64 are well suited for trying it out. It won’t give you a magical performance increase by orders of magnitude, but it’s definitely worth a shot. The more you can do while staying in the SIMD domain, the better. If you need to copy between SIMD and non-SIMD registers a lot, the overhead will probably overweigh the performance gains.

Author of SIMDRegister here: I’d just like to add that before embarking on SIMDRegister, check that your compiler isn’t already auto-vectorizing the tight loops in your code. In my experience, the compiler does a good job auto-vectorizing even moderately complex loops (but see my note at the end of this post).

To ensure that the compiler is allowed to even auto-vectorize your code, be sure that:

  1. You are building in release mode (i.e. at least optimisation level -O3)
  2. You have “Relax IEEE compliance” enabled in the Projucer. This is absolutely crucial, especially on arm/arm64, as SIMD instructions do not have the same denormal/round-to-zero (arm) and/or multiply-accumulate rounding (x86/arm) behaviour as normal IEEE compliant floating point instructions. Hence, the compiler is not allowed to replace your loops with SIMD instructions if it needs to ensure IEEE compliance.

    Note if you are compiling with -Ofast then “Relax IEEE compliance” will be enabled regardless of the Projucer setting.

To check if your code is being auto-vectorized, you need to look at the assembly. I recommend doing this, even when using SIMDRegister, as you will often have unexpected results.

I highly recommend godbolt.org for this: godbolt.org has the advantage that you can change compiler, compiler version and compiler options on the fly and it neatly colour highlights which parts of your source code correspond to which assembly.

If your project is too big to fit into godbolt then you can also use Xcode’s Assembly view:


Unfortunately, it doesn’t do any colour highliting. Update: Also, see post below on strange behaviour when trying to view assembly in recent Xcode versions.

What you will often see, is that the compiler will generate two versions of your code: one uses SIMD and the other doesn’t. This is because the compiler does not know if your audio buffer pointers are SIMD-aligned. Hence, the compiler creates a non-SIMD “pre-amble” until the buffer pointers are SIMD-aligned, the core of the algorithm (which uses SIMD) and then a non-SIMD epilogue to finish up any remaining elements that didn’t fit into a full SIMD register.

If your loop involves multiple buffer pointers then, depending on your exact algorithm, the compiler may need to create multiple versions of pre-ambles/epilogues (i.e. only the first buffer is SIMD aligned, but the second isn’t etc.). If this gets too complicated, the compiler will give up and just do two completely separate versions of your code (one with SIMD and the other without).

To avoid this, you can use C++20’s std::assumed_aligned (earlier compiler versions may have __builtin_assume_aligned ) to tell the compiler that you know that the buffer pointers will be SIMD aligned. This is recommended even for simple loops as it will avoid at least one conditional (to check for alignment) and it has the extra benefit of reducing your code-size by getting rid of the pre-ambles/epilogues.

As mentioned, the compiler is pretty good at auto-vectorizing even moderately complex loops. However, here are a few examples, where I have seen the compiler fail at auto-vectorizing:

  1. The compiler needs to follow the “as-if” rule when optimizing, i.e. it may transform your code into something completely different (i.e. by re-ordering loops, for example) but the outcome of your program must be the same “as-if” the compiler did no optimizations (unless you’ve written undefined behaviour).

    This means, however, that certain transformations are off-limits to the compiler - for example, heap memory allocations - as requiring to allocate memory is a different outcome of your program than not allocating memory.

    For example, a compiler is not allowed to auto-vectorize a direct convolution. This is because, as you move through your array, most of the time, the kernel and the array will not be SIMD aligned with respect to each other. However, by storing N copies of your kernel (where N is the number of elements in a SIMD vector) with each copy being shifted by one element, you can now always find a kernel that is SIMD aligned with your input vector. However, even a hypothetical god compiler would not be allowed to do this transformation as the compiler would need to allocate memory for the extra copies of your kernel. Here, you would need to program the convolution by hand using SIMDRegister.
  2. Conditionals in your loop: as long as you only have simple ternary like statements, the compiler will replace those statements with SIMD masking tricks (i.e. no branching). Funnily, I’ve often seen MSVC use SIMD masking code with a ternary statement (i.e. using the ? character), and not use it when writing the same code with an if statement. If the conditional gets even slightly too complex, compilers will usually use traditional branching (and thus sometimes inhibiting the use of auto-vectorization) as it’s hard for a compiler to reason about the trade-off of having the need to compute both result values when selecting via bit-mask vs. computing only the required result via branching as the performance of latter highly depends on how often the condition is true or not (becuase of branch-predictors). Hence, if you know of a way to convert your conditional into a statement with bitmasks and no branch, and the compiler isn’t doing it for you, then you will likely need to write it manually via SIMDRegister.
  3. Using complex templating and template meta programming (like expression templates) somehow confuses the compiler. I’m not entirely sure why, but I’ve seen this a lot.
19 Likes

Hi, thanks for chiming in! Good points, didn’t know about assumed_aligned. In my SIMDRegister convolution implementation, I actually implemented alignment checks and some other conditions where it doesn’t affect performance and then select either a non-SIMD or the SIMD-implementation to use. I’m always a bit reluctant to embrace the newest cpp standard, my projects are currently on cpp17.

If I find some more time to spend on this I’ll definitely check out the generated assembly. Compiler explorer is great, but often I need to see what’s going on with my production code that’s fairly large.
Very interesting to observe how minor changes can lead to massive differences in the generated assembly. I recently tried to figure out what’s the best way to do “max”, “abs”, “clamp” and such things (results were a little inconclusive with a tendency to not prefer the STL). :wink:

If auto vectorisation can yield good results than I’d definitely prefer this over a manual solution using SIMDRegister, usually the compiler is smarter than I am when it comes to low level things like these. Writing the code in a way that provides the compiler with the information and the guarantees it needs sometimes is a challenge of its own.

2 Likes

Just a quick update: it seems that Xcode’s behaviour when selecting “Assembly” as seen in the image below is a bit strange with recent Xcode versions (confirmed in Xcode 13 and 14).


If the current build configuration is Debug then using the above feature will show assembly for the target processor. If the current configuration is Release then the above feature will show LLVM IR - which is not useful for finding out if the compiler did auto-vectorization or not.

Strangely, this does not seem to depend on the build settings of the current configuration. In fact, I made sure that all build settings of my debug and release build were identical. I would expect link-time optimizations to influence this but it doesn’t.

So to check if your code is being auto-vectorized in Xcode, ensure that the debug build settings that affect auto-vectorization (i.e. optimization level, IEEE math compliance) are the same as the release build (i.e. -Ofast). Then make sure that the current build configuration is debug. Then you can use the above feature.

Does anybody have any other ideas what could be affecting this?

3 Likes

Thanks for the response, very detailed and extremely insightful. I will also like to add I’m a fan of your work and farbot helped me learn the concept of mutex as well as scope and locking with juce. It’s a really cool repo!

I am in the middle of a software crunch 20 days out from releasing a two year project with 3 3’d oscillators, and 3 samplers with granular reconstruction. I have seen great success in terms of CPU optimization with vector operations and was curious about translating that to signal processing functions. My goal was to essentially see those performance improvement of copying a buffer, but with calculations across the entire synthesizer.

Your comment saved me time, The first thing I was going to do was to template every function then use your SIMD wrapper as the argument. However, I wont be able to test this hypothesis until after this launch… but that seems like a no go. I also know on my small scale tests using floats versus SIMD variables that the increase was negligible in a 10000 incremented loop. I typed a longer response and then reread your comment a few times and realized we were saying the same thing, and you had told me to manually type out my functions to guarantee results. Which leads me to believe to process samples with simd variables must be done 2 or 4 at a time to see increased performance.

I have recently ran small scale tests with loops using smaller variables to hopefully increase performance such as char, shorts,int within while loops and for loops. It seems for normal loops with the prefix operator ++i is the most stable and minimally faster than most, with ++i in a while being sometimes faster yet more unstable. I imagine if this was 10 years ago, char, and short would be the clear favorite in these tests. However, this isn’t the case, and I assume because “compiler optimizations”… Which seems to be the case with the template tests, as-well as my initial question.

It seems the bottle neck with predictability will be the loop itself, or am I wrong?

I think to maximize the SIMD variable class you would have to process two or four variables at a time rather than a single loop. So a half loop or quarter loops, with an exit for the non power of 2/4 samples.

How do you recommend distinguish between the 2 way and 4 way functions? Do we need to?

page 6

OR… am I completely missing the ball here. I kind of feel like I’m wrong on what I said, because we could just unroll loops with the school of thought I presented…