I’m currently building a large scale Audio Mixer/Matrix with many input and output channels.
So the DSP algorithm is wrather simple (“copy this buffer with this gain here, this here etc.”) but for many channels and connections.
When profiling my CPU load, I see that the load is alternating between two cores (maybe each 2-3 sec, maybe some heat-prevention thingy?) while the rest of them are all chilling at 0%.
(Interestingly my old MacBookPro 2012 (OS 10.13) seemed to spread the load on all of its 4 cores and didn’t do the alternating pattern, but not 100% sure on this one.)
I looked through the forum for information about multithreading in JUCE, but most of the cases are for priority things like separation of GUI/ Audio thread or background-loading stuff.
In my case the question is: Is there any benefit & possibility to gain performance from separating my real-time (all equally high priority) channel calculations onto different threads?
E.g let Thread/Core 1 calculate channels 1-32, while 2 is doing 32-64 and so forth…
Is this in any way feasible? If so has anyone any idea / starting point / tipps for how to do it?
For example, I wouldn’t understand how I would synchronize the individual run() functions from the threads with my IO-Callback?
Also my targeted bufferSize is only 32 samples, so very short buffers ( I read that so small bufferSizes are bad for multithreading since it increases overhead by a lot?).
Sorry for all those questions at once (needs a refactor )
Thanks in advance for any more insights into this !
PS: the multithreadDemo in the DemoRunner crashes after some time for me, maybe it’s just me, just wanted to report this
first of all look at Intel IPP IMO
Thanks for the fast reply,
So as far as I understand it, it’s a Library from Intel for managing those multimedia tasks on Intel CPUs, getting a deeper control of what runs where and how, is that correct?
For a c++ beginner like me who is glad his little juce-app is running well, it seems like a daunting task to move to / work with this API/ Library and leave my save JUCE-world.
But if there is no way to do it in the JUCE API-world, this seems to be a very powerful tool
Will definitly have a look at it, thank you very much!
There is IPP, which is a SIMD library.
Regarding threading, you can look as well Intel Threading building block, but it doesn’t work nicely in a realtime environement, so you will need to setup your worker thread manually.
But SIMD can already give you a big boost if you don’t use it yet
If you are just copying/summing audio around, it probably isn’t worth multithreading that because of the overhead of the threading, syncing of the threads and all the memory accesses involved. (Of course I can’t say that for 100% sure, you would have to implement the multithreading first and then see if it actually helped anything.)
Thank you both for the explanation. Will have a further look into SIMD! since im using JUCE audiobuffer methods I’m probably already using SIMD under the hood, but will see if I can apply it on a larger scale there.
After some more digging I came up on this stackoverflow thread about parallel loops:
the top answer suggests a std::for_each with the parallel flag set (c++ 17).
Maybe this is useful for me ( and someone else stumbling across this in the future).
Will have a look at it aswell!
For what you explain your DSP code only mixes the channels (multiply and sum). If that’s the case any modern CPU can handle that without sweating in a single core. Even more if you process those channels using SIMD intrinsics where you can do 4 operations in a single instruction. What I’d do is: implement your mixer, profile it, and start optimizing hotspots.
Multithreading audio is a pain in the ass, and usually has more disatvantages than advantages, so I’d say only if you are creating a whole DAW with tons of processing modules makes sense to go multithread.
If you end up having a good multithread solution I’ll be glad to hear how did u do it tho
I tried to accelerate some block wise audio rendering a few years ago with not so much in-depth knowledge back then using OpenMP pragmas. The result was a lot slower execution. What I guess was the problem back then is that all the thread creation/destruction stuff is hidden away from you. These solutions are aimed at lot bigger problem sizes than our usual audio block sizes, so in these contexts, the overhead of spawning a new thread might be neglible compared to the computational performance boost. And while not having an in-depth look at C++17 execution policies I guess these implementation might suffer the same problem.
If you really want to do multithreaded realtime safe audio rendering you have to manage the threading including synchronization explicitly and should not use convenient libraries that hide away the existence of threads.
On the other hand, convenient wrappers for SIMD operations like Intel IPP or the more technical juce SIMDRegister class are a good thing, but that is some kind of easier to handle parallelism as those operations run on a single thread / a single CPU core and are just a parallelized way for your CPU to access the data