SIMDRegister usage in Debug

I understand because it’s heavily templated the Debug mode can be a CPU hog, but I got to a point where my plugin in Debug mode eats over 100% CPU while the Release version is about 2-5% CPU.

So actually I can’t debug my code now (I can generate symbols in Release mode but it takes 10x the time to compile and some DEBUG features won’t work, so makes life harder). Is there a way to force SSE optimization in debug mode or the only possible route is to use intrinsics instead of the SIMDRegister class?

1 Like

What helps a lot is enabling inlining for DEBUG builds for these classes. Unfortunately this means patching the JUCE code as the forcedinline is disabled for debug builds. I copied the simdregister code and added my own forcedinline which works regardless of build type. In my opinion that’s the way it should be in the first place as release builds tend to inline all short methods anyway. Without this, every single operation that involves SIMDRegister gets compiled to a jump to a subroutine with a lot of overhead.

3 Likes

I’m experiencing a least a 30X slowdown in Debug on Windows.
Where a stereo delay line that takes up less than ~2 percent in release, takes over ~60 in Debug.
I could cope before by setting in-lining in VS, but that no longer helps in VS 2022 it seems.

Is there any way this be reduced properly? I’m find it frustrating and unusable as it stands.

We moved some parts of the JUCE code that slowed down the execution the most (especially filters) to a separate translation unit that’s always compiled with full optimisation flags in debug builds. In release builds we still use the header-only version to get maximum possible inlining. For us this works well, but you obviously need to modify juce for that and you need a way to set compile flags for a certain single TU in a JUCE module. This is no big deal in CMake, but might involve quite a bit of work when using the Projucer. So maybe not the kind of answer you are looking for

Thanks. I’m guessing there’s not many Windows devs in the Juce team. :grin:

edit - in all seriousness though, for me, the SIMD code has gone from very cool, to very rubbish to use.

Hi. Is there any way someone could share what they did to the SIMD code to speed up debug? I can’t get it much faster, and it’s hindering my usage of it quite a lot in debug.

In these results I’m trying to wrap a SIMD ‘pos’ between zero and ‘buff_wrap.’
This is a short profiling showing the large number of hits the mask code takes…

Can anyone help? I can’t seem to get it to force inline.
I’m on VS 2022, which seemed to react differently in debug than other VSs.

Have you tried working with other compiler optimisation levels? Another suggestion (which I think has been made) is to set full optimisation for certain cpp files and try and work with that.

Or…you could just bite the bullet and make release your primary running mode, reverting back to debug only when the wheels fall off and you need to.

The games industry has had this debug/release disparity as an issue since forever.

1 Like

I’ve been in the computer game industry for years, on many many platforms, and I’ve never seen a difference like this before! It’s quite astonishing how slow it is in debug.

I’m going to try the LLVM (Clang -cl) option and see if it makes a difference.

Here’s some disassembly of just one line of code with the in-lined debug mode
Unfortunately I can’t use break points with optimisation on, in debug for some reason, so I can’t show the fast version…

This line

		pos = ((pos + buff_wrap) & mask) + (pos & (~mask));

Is this code…

00007FFDA8D56E4A  mov         rax,qword ptr [this]  
00007FFDA8D56E52  movups      xmm0,xmmword ptr [rax+90h]  
00007FFDA8D56E59  movdqa      xmmword ptr [rsp+0C20h],xmm0  
00007FFDA8D56E62  mov         rax,qword ptr [rsp+300h]  
00007FFDA8D56E6A  mov         qword ptr [rsp+20h],rax  
00007FFDA8D56E6F  movaps      xmm0,xmmword ptr [rsp+0C20h]  
00007FFDA8D56E77  movaps      xmmword ptr [rsp+0C40h],xmm0  
00007FFDA8D56E7F  movaps      xmm0,xmmword ptr [pos]  
00007FFDA8D56E87  movaps      xmmword ptr [rsp+0C30h],xmm0  
00007FFDA8D56E8F  movaps      xmm0,xmmword ptr [rsp+0C30h]  
00007FFDA8D56E97  addpd       xmm0,xmmword ptr [rsp+0C40h]  
00007FFDA8D56EA0  movaps      xmmword ptr [rsp+0C50h],xmm0  
00007FFDA8D56EA8  movaps      xmm0,xmmword ptr [rsp+0C50h]  
00007FFDA8D56EB0  movaps      xmmword ptr [rsp+0C60h],xmm0  
00007FFDA8D56EB8  movaps      xmm0,xmmword ptr [rsp+0C60h]  
00007FFDA8D56EC0  movaps      xmmword ptr [rsp+0C70h],xmm0  
00007FFDA8D56EC8  movaps      xmm0,xmmword ptr [rsp+0C70h]  
00007FFDA8D56ED0  movaps      xmmword ptr [rsp+1C00h],xmm0  
00007FFDA8D56ED8  lea         rax,[rsp+1C00h]  
00007FFDA8D56EE0  mov         qword ptr [rsp+20h],rax  
00007FFDA8D56EE5  lea         rax,[rsp+0C90h]  
00007FFDA8D56EED  lea         rcx,[mask]  
00007FFDA8D56EF5  mov         rdi,rax  
00007FFDA8D56EF8  mov         rsi,rcx  
00007FFDA8D56EFB  mov         ecx,10h  
00007FFDA8D56F00  rep movs    byte ptr [rdi],byte ptr [rsi]  
00007FFDA8D56F02  lea         rax,[rsp+0CA0h]  
00007FFDA8D56F0A  lea         rcx,[rsp+0C90h]  
00007FFDA8D56F12  mov         rdi,rax  
00007FFDA8D56F15  mov         rsi,rcx  
00007FFDA8D56F18  mov         ecx,10h  
00007FFDA8D56F1D  rep movs    byte ptr [rdi],byte ptr [rsi]  
00007FFDA8D56F1F  mov         rax,qword ptr [rsp+308h]  
00007FFDA8D56F27  mov         qword ptr [rsp+28h],rax  
00007FFDA8D56F2C  movdqa      xmm0,xmmword ptr [rsp+0CA0h]  
00007FFDA8D56F35  movdqa      xmmword ptr [rsp+0CB0h],xmm0  
00007FFDA8D56F3E  movdqa      xmm0,xmmword ptr [rsp+0CB0h]  
00007FFDA8D56F47  movdqa      xmmword ptr [rsp+1C10h],xmm0  
00007FFDA8D56F50  movaps      xmm0,xmmword ptr [rsp+1C10h]  
00007FFDA8D56F58  movaps      xmmword ptr [rsp+0CD0h],xmm0  
00007FFDA8D56F60  movaps      xmm0,xmmword ptr [rsp+0CD0h]  
00007FFDA8D56F68  movaps      xmmword ptr [rsp+0CF0h],xmm0  
00007FFDA8D56F70  mov         rax,qword ptr [rsp+20h]  
00007FFDA8D56F75  movups      xmm0,xmmword ptr [rax]  
00007FFDA8D56F78  movups      xmmword ptr [rsp+0CE0h],xmm0  
00007FFDA8D56F80  movaps      xmm0,xmmword ptr [rsp+0CE0h]  
00007FFDA8D56F88  andps       xmm0,xmmword ptr [rsp+0CF0h]  
00007FFDA8D56F90  movaps      xmmword ptr [rsp+0D00h],xmm0  
00007FFDA8D56F98  movaps      xmm0,xmmword ptr [rsp+0D00h]  
00007FFDA8D56FA0  movaps      xmmword ptr [rsp+0D10h],xmm0  
00007FFDA8D56FA8  movaps      xmm0,xmmword ptr [rsp+0D10h]  
00007FFDA8D56FB0  movaps      xmmword ptr [rsp+0D20h],xmm0  
00007FFDA8D56FB8  movaps      xmm0,xmmword ptr [rsp+0D20h]  
00007FFDA8D56FC0  movaps      xmmword ptr [rsp+1C20h],xmm0  
00007FFDA8D56FC8  lea         rax,[rsp+1C20h]  
00007FFDA8D56FD0  mov         qword ptr [rsp+28h],rax  
00007FFDA8D56FD5  movdqa      xmm0,xmmword ptr [mask]  
00007FFDA8D56FDE  movdqa      xmmword ptr [rsp+0D60h],xmm0  
00007FFDA8D56FE7  movdqa      xmm0,xmmword ptr [juce::dsp::SIMDNativeOps<unsigned __int64>::kAllBitsSet (07FFDA9B91C90h)]  
00007FFDA8D56FEF  movdqa      xmmword ptr [rsp+0D40h],xmm0  
00007FFDA8D56FF8  movdqa      xmm0,xmmword ptr [rsp+0D40h]  
00007FFDA8D57001  movdqa      xmmword ptr [rsp+0D50h],xmm0  
00007FFDA8D5700A  movdqa      xmm0,xmmword ptr [rsp+0D50h]  
00007FFDA8D57013  movdqa      xmmword ptr [rsp+0D70h],xmm0  
00007FFDA8D5701C  movdqa      xmm0,xmmword ptr [rsp+0D60h]  
00007FFDA8D57025  pandn       xmm0,xmmword ptr [rsp+0D70h]  
00007FFDA8D5702E  movdqa      xmmword ptr [rsp+0D80h],xmm0  
00007FFDA8D57037  movdqa      xmm0,xmmword ptr [rsp+0D80h]  
00007FFDA8D57040  movdqa      xmmword ptr [rsp+0D90h],xmm0  
00007FFDA8D57049  movdqa      xmm0,xmmword ptr [rsp+0D90h]  
00007FFDA8D57052  movdqa      xmmword ptr [rsp+0DA0h],xmm0  
00007FFDA8D5705B  movdqa      xmm0,xmmword ptr [rsp+0DA0h]  
00007FFDA8D57064  movdqa      xmmword ptr [rsp+1C30h],xmm0  
00007FFDA8D5706D  lea         rax,[rsp+1C30h]  
00007FFDA8D57075  mov         qword ptr [rsp+0F0h],rax  
00007FFDA8D5707D  mov         rax,qword ptr [rsp+0F0h]  
00007FFDA8D57085  movups      xmm0,xmmword ptr [rax]  
00007FFDA8D57088  movdqa      xmmword ptr [rsp+0DC0h],xmm0  
00007FFDA8D57091  movdqa      xmm0,xmmword ptr [rsp+0DC0h]  
00007FFDA8D5709A  movdqa      xmmword ptr [rsp+0DD0h],xmm0  
00007FFDA8D570A3  movdqa      xmm0,xmmword ptr [rsp+0DD0h]  
00007FFDA8D570AC  movdqa      xmmword ptr [rsp+1E40h],xmm0  
00007FFDA8D570B5  movaps      xmm0,xmmword ptr [rsp+1E40h]  
00007FFDA8D570BD  movaps      xmmword ptr [rsp+0DF0h],xmm0  
00007FFDA8D570C5  movaps      xmm0,xmmword ptr [rsp+0DF0h]  
00007FFDA8D570CD  movaps      xmmword ptr [rsp+0E10h],xmm0  
00007FFDA8D570D5  movaps      xmm0,xmmword ptr [pos]  
00007FFDA8D570DD  movaps      xmmword ptr [rsp+0E00h],xmm0  
00007FFDA8D570E5  movaps      xmm0,xmmword ptr [rsp+0E00h]  
00007FFDA8D570ED  andps       xmm0,xmmword ptr [rsp+0E10h]  
00007FFDA8D570F5  movaps      xmmword ptr [rsp+0E20h],xmm0  
00007FFDA8D570FD  movaps      xmm0,xmmword ptr [rsp+0E20h]  
00007FFDA8D57105  movaps      xmmword ptr [rsp+0E30h],xmm0  
00007FFDA8D5710D  movaps      xmm0,xmmword ptr [rsp+0E30h]  
00007FFDA8D57115  movaps      xmmword ptr [rsp+0E40h],xmm0  
00007FFDA8D5711D  movaps      xmm0,xmmword ptr [rsp+0E40h]  
00007FFDA8D57125  movaps      xmmword ptr [rsp+1CC0h],xmm0  
00007FFDA8D5712D  lea         rax,[rsp+1CC0h]  
00007FFDA8D57135  mov         qword ptr [rsp+0F8h],rax  
00007FFDA8D5713D  lea         rax,[rsp+0E50h]  
00007FFDA8D57145  mov         rdi,rax  
00007FFDA8D57148  mov         rsi,qword ptr [rsp+0F8h]  
00007FFDA8D57150  mov         ecx,10h  
00007FFDA8D57155  rep movs    byte ptr [rdi],byte ptr [rsi]  
00007FFDA8D57157  lea         rax,[rsp+0E60h]  
00007FFDA8D5715F  lea         rcx,[rsp+0E50h]  
00007FFDA8D57167  mov         rdi,rax  
00007FFDA8D5716A  mov         rsi,rcx  
00007FFDA8D5716D  mov         ecx,10h  
00007FFDA8D57172  rep movs    byte ptr [rdi],byte ptr [rsi]  
00007FFDA8D57174  movaps      xmm0,xmmword ptr [rsp+0E60h]  
00007FFDA8D5717C  movaps      xmmword ptr [rsp+0E80h],xmm0  
00007FFDA8D57184  mov         rax,qword ptr [rsp+28h]  
00007FFDA8D57189  movups      xmm0,xmmword ptr [rax]  
00007FFDA8D5718C  movups      xmmword ptr [rsp+0E70h],xmm0  
00007FFDA8D57194  movaps      xmm0,xmmword ptr [rsp+0E70h]  
00007FFDA8D5719C  addpd       xmm0,xmmword ptr [rsp+0E80h]  
00007FFDA8D571A5  movaps      xmmword ptr [rsp+0E90h],xmm0  
00007FFDA8D571AD  movaps      xmm0,xmmword ptr [rsp+0E90h]  
00007FFDA8D571B5  movaps      xmmword ptr [rsp+0EA0h],xmm0  
00007FFDA8D571BD  movaps      xmm0,xmmword ptr [rsp+0EA0h]  
00007FFDA8D571C5  movaps      xmmword ptr [rsp+0EB0h],xmm0  
00007FFDA8D571CD  movaps      xmm0,xmmword ptr [rsp+0EB0h]  
00007FFDA8D571D5  movaps      xmmword ptr [rsp+1E30h],xmm0  
00007FFDA8D571DD  lea         rax,[rsp+1E30h]  
00007FFDA8D571E5  mov         qword ptr [rsp+100h],rax  
00007FFDA8D571ED  mov         rax,qword ptr [rsp+100h]  
00007FFDA8D571F5  movups      xmm0,xmmword ptr [rax]  
00007FFDA8D571F8  movdqa      xmmword ptr [pos],xmm0

[edit] I just used optimisation O1 and I could stop and look at the code, although it’s probably an unfair comparison because it needs to take into account the surrounding code.
Anyway, I could work with that for now, thanks for the suggestion, @Nitsuj70

OK, I’m using LLVM installed with VS2022 installer, it was easy to select LLVM -clang in Visual Studio.
LLVM in debug brought it down to about 46 percent, which is still terrible but workable for me right now.

Unfortunately it seems the projucer can only select the standard MS toolsets. Which is a shame.

I’m doing a lot of SIMD work currently myself for processing synthesiser voices. It seems to be a fact of life that you’ll be taking a big hit in performance when working in Debug build due to the lack of inlining etc which is pretty important for SIMD vector libraries.

To help with this I created a ‘Develop’ build to go along with Debug and Release. Develop build is set to fully inline but still include debug information. If things really go wrong it’s easy to fall back to ‘Debug’.

I’m fully clang as well but I’ve recently transitioned from XCode to using VSCode and cmake which, once you’ve spent the time getting it all working, makes adding new configs pretty easy.

I’ve used the Juce SIMD library for a few projects and it’s always been acceptable in Debug. It’s just that some code is really slow in Debug. Cripplingly so.

I don’t want to use Cmake. :grin:
And just setting in-lining doesn’t help any more, it seems. Well not for me anyhow. At least with the MS toolset.

Is it really that difficult is it to add a clang option to the Projucer?

I hear you. I’ve got my own SIMD library that I’m using but whilst I’m getting a 50% boost over scalar (wavetable scanning) in release, it’s 3-4x slower in debug. It only starts clawing speed back at -O2 (clang).

There’s this: Clang/LLVM compilation on W10 x64

Don’t know if it’s of any use.

Your own SIMD library? Very cool. I was thinking of trying the GLM library, I’ve used it before for its swizzle enabled vector stuff for GLSL in c++. I really liked it a few years ago, but I want to use Juce’s library.

I didn’t know about GLM but that looks really good. I looked at a lot of SIMD libraries but in the end I needed something lightweight and only supporting float32x4 and int32x4. So it’s a few vector classes wrapping intrinsics and a whole bunch of functions for fast math and utility functions (interpolation, clamping etc). I use the ‘simde’ library to provide compatibility with NEON on ARM.

Anyhow, all good fun. The Juce SIMDRegister stuff looks solid.