[DSP module discussion] New class SIMDRegister

dsp_module

#1

Vectorization is a huuuuuge topic in DSP and plug-in development, and it is also something very tricky to get right to have significant improvements in speed, which sometimes need a lot of work on the developer side, even if the auto vectorization of modern compilers and some basic rules to follow during developement already help a lot getting good performance in audio algorithms.

I have to confess I don’t know that topic very well myself, and it’s only thanks to @fabian + @jules that I have been able to code successfully a fast convolution algorithm in the dsp::Convolution class, using the base code in JUCE which helps already the developers to optimize with vectorization their code when possible. For example, I used the functions provided in the class FloatVectorOperations, and also I made a good use of the class AudioBuffer<float> which aligns automatically the audio data with the register sizes to allow vectorisation. I had also to reorganize some FFT bins on the fly to be able to get the “four times speed-up”.

But thanks to @fabian, there is also a new way to improve the handling of vectorization and to optimize your code thanks to SSE+AVX operations when available : the new class SIMDRegister available since JUCE 5.1 in the DSP module.

The beauty of this class is that it is a type, in the same sense that float or double are types. That means that it is possible, when the context is compatible with SIMDRegister, to do your usual operations (addition, multiplication etc.) automatically using the vectorized registers. And since it is templated, you can put inside an array of float, doubles, even complex numbers if the target machine is compatible with the right set of instructions.

For more information about how to use it, the best thing to do is to have a look in the demo applications, such as the DSP module app demo, or in some of the new classes from the DSP module. You’ll see that the filtering classes are compatible with SIMDRegister which means you can process 4 channels with filters for the cost of one, thanks to that class !

And I think you can’t imagine how difficult it has been to make the IIR/FIR/StateVariable right during development, since we wanted to provide their new functionalities in the DSP module, but also to use all the stuff around inside (the new Processor API, the wrapper+context concepts, the Duplicator concept, templating everything, and the SIMDRegister class of course). Well, it has been there also specifically difficult for @fabian and @jules, since they were the ones who provided the application of the new Processor concepts :slight_smile: (more about them later)

But anyway, once you get how SIMDRegister work, you can use it to optimize your code with SIMD acceleration quite quickly, but of course only when the context allows it. Which means only when you have operations that you can perform in parallel. For example, recently I did code from scratch a very simple filter class, and thanks to SIMDRegister I have been able to use 32 of them in parallel, processed by blocks of four. At the end of the day, it’s one of the classes I use nowadays the most from the DSP module, the class AudioBlock being still the first one.

Tell me what you think of that class, and if you have already used it !


#2

One thing that would be MASSIVELY beneficial for using these classes would be a way to abstract a loop for processing data which doesn’t rely on surrounding samples, for times when the vector you’re iterating isn’t an exact multiple of the side of a SIMD vector (i.e. an odd number of samples).

This would mean that you could write a single loop function which can use both SIMD and non-SIMD paths. Having to write my loops twice for SIMD and non-SIMD is currently the most tedious bit of SIMD coding.

My first instinct for approaching this would be some sort of templated functor, but I haven’t given it much serious thought yet.


#3

I finally tried using the new SIMDRegister classes. Before I was using simdpp until I found MSVC 2017 optimizes the code so much, it doesn’t work anymore. I like the quite clear and understandable design of these JUCE classes, however I wanted to start a discussion about one point. Ivan, you wrote in the code:

Note that using SIMDRegister without enabling optimizations will result
in code with very poor performance.

I definitely am seeing that for debug builds and in fact the performance seems to be really bad (on OSX). Looking at the assembly I see a lot a glue code produced and I wonder why that is and whether it maybe could be improved somehow. Right now the penalty is quite high, it gets hard to run real-time code which is of course what these are used for.


#4

Could this maybe be improved by using some more forcedinline in the SIMDRegister class? It’s used in the subclasses for the different platforms, but not for operators etc. If I read correctly debug builds won’t inline anything unless it has forcedinline.


#5

Answering myself … forcedinline is disabled for JUCE_DEBUG builds… for this case it would be great to have an inline attribute that can be used in debug builds as well.


#6

Hi @pflugshaupt, JUCE’s SIMDRegister class makes heavy use of templates (as many other modern SIMD libraries) and, therefore, will always create a lot of glue code in debug mode. There is simply no way around this. I don’t think forcedinline would be enough.

When comparing to other SIMD libraries, then these libraries usually come pre-compiled, i.e. the SIMD library was compiled in release mode. Obviously, you can still use that library even if your app is in debug mode and hence the big performance differences.

There is no easy way of solving this as JUCE is just source code (no pre-compiled libs). Maybe one day we could have a feature where you could switch between debug and release on a per-module basis.


#7

Hi fabian. I don’t want to offend but I disagree on multiple points. Templated code does not necessarily create a lot of glue code in debug mode, the opposite is true. The compiler is forced to evaluate and preprocess templated c++ regardless of optimization settings and therefore it usually leads in “inlined” code even in debug builds.

The SIMD libraries I’ve seen (simdpp, vcl, boost.simd), are all template based, so they don’t get pre-compiled. Having them in a separate unit would be terrible and kill performance as inlining is crucial for intrinsics.

That’s the whole point of my argument. With this kind of thing, one line of c++ code eventually boils down to one cpu instruction and that’s why forcing inlines even for debug builds would be crucial to get reasonable performance. Right now with SIMDRegister in a debug build each statement gets transformed into a static noninlined call that then executes one simd instruction. Between that there’s a lot of glue coming from the compiler not being allowed to optimize away variables and you end up with something that performs worse than a non-simd implementation.

In the meantime I copied the JUCE classes locally and forced inlining for debug builds using inline attribute((always_inline)) and it does help getting rid of all those jumps.

Being able to switch to release for single modules would not help at all because again… instrinsics desparately need to be inlined and cannot be compiled into a separate module.


#8

I don’t think you can generally say if performance will decrease or increase in debug mode when using templated code. It really depends on the code: even when being inlined, you may end up with a lot of temporary variables in templated code which have trivial constructors which will be optimised away in release mode.

But sure, there are many cases where templated code will move the evaluation of some code from run-time to compile-time.

I only have experience with boost.simd and the last time I tested it, it had terrible performance when using it in debug mode. Some of the SIMD libraries use #pragma GCC optimize ("O3") (and equivalent for other compilers) inside their code to turn on the optimiser.

Yes, I understand how inlining works.

I’m happy to change this but it seems that performance still isn’t where you would want it to be. So I still think that forcedinline is not enough?

I was referring to being able to switch the optimisation of the module using the SIMD module and the SIMD module itself.


#9

Thanks for the detailed answer. I see about the modules. That would make sense of course, but I’m not using modules for my code, so it wouldn’t help me.

About the speed gain from inlining. I don’t expect the same speed as a release build, I would just like to be able to use SIMD code in a debug build that can still run in real-time and I find this can currently get tricky, so any speed gain would be welcome. I solved the issue for myself by forcing the inline, so I’m all good now, just trying to help others.

The use of forcedinline in some of the SIMDRegister classes makes me wonder whether this was maybe intended by the creators. Unfortunately forcedinline currently does nothing for Juce debug builds - which I find a bit weird. What is the intended use of this macro? In a build with inline optimization I imagine the compiler will inline all functions currently marked forcedinline anyway.

I looked how the very similar BOOST_FORCEINLINE is defined and it indeed appears to be intended for debug builds and builds with no optimization, so some of the most fundamental stuff doesn’t kill performance too much while debugging.


#10

Yeah I think you are right that the forcedinline is actually really only useful for debug mode. I’ll discuss with the JUCE team on how we might change this (we may not want to inline all the code that previously was marked with JUCE’s forcedinline).