Any tricks to speed up debug builds that make heavy use of SIMDRegister?

The JUCE SIMDRegister is great as it makes it super easy to write SIMD code and use the JUCE processors with SIMD types instead of floats. However, after reworking big parts of our oversampled signal chain which has some FIR filters in the oversampled path to use SIMD, we see a great speedup in release builds and a massive slowdown in debug builds. The profiler shows that the hotspot is the multiplication operator of the SIMDRegister that is not inlined as in release builds but causes three sub-function calls for each multiplication. This is expected, I know, but the slowdown is so heavy that the plugin no longer hits realtime constraints in a debug build, which makes it pretty hard to develop.

So I’m wondering if there are any tricks to guide the compiler to simply fully optimise the calls to the SIMDRegister operators always, no matter what optimisation level is chosen. Some fancy attributes to be wrapped around the SIMDRegister implementation? I also think about solutions like moving e.g. FIR::Filter::processSingleSample to a separate TU and set compiler flags on that file to always compile it with full optimisation?

I already asked for help on Stackoverflow but I would be happy to hear your thoughts on that, I guess I’m not the first one in the JUCE universe facing this problem :wink:

I had some time to try some things out today and what really helped in a first proof-of-concept approach was to add a second cpp file to juce_dsp, specify -O3 for that file and then moved all the internals of the process call along with the processSingleSample function into that TU. With these changes, the plugin is able to play back at realtime again, even in debug builds.

Would you consider to add a solution like that to the dsp module?

1 Like

I also had some problems a few years back that even the standard Microsoft compiler in the release build could not remove the SIMDRegister wrapper and it did run a lot slower on windows. I switched to clang / LLVM also on windows because of this. Is this still a problem today?

Edit: Today I don’t optimize anymore on this level and let the compiler do it’s work.

MSVC 2019 has no problems optimising the SIMDRegister calls as expected in a release build. We used it quite a bit in our latest release that really does some heavy processing on oversampled audio and switching to SIMD made a noticeable impact on the CPU usage, especially when processing 5.1 audio.

This is indeed some kind of optimisation that would be pretty impossible for the compiler to figure out on its own, given that you need to re-order the samples in order from the usual per-channel representation to the interleaved SIMD representation. So I’m happy that the SIMDRegister class is there, especially with Apple M1 machines out there, which makes it fairly easy to maintain a cross-architecture compatible implementation.

1 Like