Hello, I have successfully converted a large project to safely be parallel processed. I currently have 6 generators, 3 oscillators, and 3 samplers. I was seeing successes processing the subsequent channels using a thread in the process block.
However, I’ve noticed that at higher sampleRates and smaller buffers, the idle cpu is higher. I was curious if anyone knew how to spin up multiple realtime threads/thread pool to be accessed on the process block as a member to the class.
I am currently using std::thread to launch a function and join it in the process block. However, I feel there has to be better way than creating a thread which is (100,000 nano seconds)
I’ve been looking into the juce threadPool, however it seems like a sure fire way to blow my foot off. The documentation said if it can’t create a thread at that priority it wont run, but I don’t want to make the priority too low either.
How do I decided the right priority, without blowing my foot off? Is there a better method? Would you recommend a thread pool?
Thanks,
P.S.
“If you had one shot… to multithread all the audio you ever wanted, would you prioritize it… or would you let it slip, to the efficiency cores?” - a misquoted rap lyric
This is more likely to harm your performance than help it.
It is not possible to correctly use multiple realtime threads that communicate with each other without using thread synchronization mechanisms, which is likely to eliminate any performance gain from the multi threading, not to mention making the code much more complex to reason about.
1 Like
After testing many different threading options, I am forced to agree. Although, I did find that multithreading is viable at around roughly 256 samples and above, 2048 buffer having the best results. Provided the threads were used to populate buffers without correlation. If anyone is comfortable with allocation in processBlock… (*yikes)
I think arm is pushing for parallel type stuff with c++20, but idk. It was honestly fun and a fresh take on programing for me to try, I even found some normal thread issues. Maybe, if I could “guarantee” that each threads process would take longer than 100 micro seconds it would be worth it, but otherwise I don’t think that’s possible without some sort of latency or sample synchronization madness, that would end up increasing cpu, like you mentioned.
I ran into that with a pitch algorithm in the threads, the cpu was dramatically increased waiting for the longer processes.
Side note question on threads…I couldn’t find a “priority” on std::thread. Do you know if juce is a green thread or if VST3’s are considered vm green threads?
I got a daemon to work but when I recursively printed it didn’t seem… fast… before it broke. lol
Thank you for saying something, otherwise I would of been pulling my hair out and convincing myself if I just… think… harder, I will figure it out. 
1 Like
It’s definitely a pain to manage and may not be as efficient with very small buffer sizes, but I’ve had success with multi-threading for synth voices processing.
I use a custom thread pool class (using Juce::Thread) with atomic operations to check remaining ‘jobs’ but all jobs are the same callback type… one per synth voice… and so the process time is very similar because the signal chain of each voice is identical.
There’s no allocation in processBlock - the threads are pre-allocated and then sleep until notified to wake on each callback - whilst I’ve not seen it happen, if a Juce::Thread ‘startThread’ failed, I simply won’t add it to my pool.
There is of course thread sync in terms of waiting for all threads to complete on each processBlock.
Performance has been great with Intel&AMD CPUs, but now I’m getting issues on Apple silicon with the E/P cores split and threads getting demoted to e-cores. That is going to require making sure threads join the same audio workgroup (something in Apple’s audio APIs I only became aware of recently - see discussion on the forum about that).
The latest JUCE thread API updates with ‘startRealtimeThread’ with RealtimeOptions doesn’t help with workgroups/multiple real-time threads (in fact it’s worse partly as the default priority requested in 7.0.3 is ‘high’ and I see threads dropping to e-cores when my standalone app loses focus to another app window unless I forced the priority to ‘highest’) (on M1 mac).
I’ve also been suggested to look into running my synth voices or at least synth layers as separate processes… but that is too much for my brain right now (some DAWs are running track/plugin instances in separate processes - Reaper is one I tested and it works well even on Apple silicon).
1 Like
If you’re notifying/waking all your worker threads during each process block, I’m surprised that gives you good performance, because that is a system call.
I have to be honest with you that I didn’t consider too hard what is going on below the surface system call wise but I use Juce::Thread wait(-1) to wait after task completion and Juce::Thread notify() to start all threads back up to process a task list on each audio callback, and it’s been working fine with multiple polyphonic synth layers (Hyperion synth).
Profiling shows parallel processing of the various audio processing blocks in patches working as expected, and the usage monitor calculation in the main audio callback shows plenty of remaining spare time in line with what DAWs report as the audio load.
I only started seeing issues recently with Apple silicon thread demotion from P to E cores - it hurts especially on the M1 Max, but should be resolved with the audio-workgroups API in theory.
At least my plugin has the option to run the audio process single threaded (or reduced threads) if need be, but then audio load is significantly worse on those systems that don’t have those thread-core assignment demotion problems.
If you’re notifying/waking all your worker threads during each process block, I’m surprised that gives you good performance, because that is a system call.
<— how would you do differently ?
This works pretty fine and this is how a multi core DAW would do as well or I a missing something.
Multi threading done by the DAW is a bit different. Personally I would avoid multiple audio threads in a plugin, to me it seems like asking for trouble.
How is this different in a DAW ?
The only difference to me is that only the DAW is doing it as opposed to possibly multiple plugin but I don’t see how DAW would implement it differently ?
No that is exactly the difference. If every plugin tries to spin up 10 of its own realtime threads, performance of the entire DAW would be degraded.
You said:
If you’re notifying/waking all your worker threads during each process block, I’m surprised that gives you good performance, because that is a system call.
then you said:
Multi threading done by the DAW is a bit different
You’re not answering the question here.
I agree that if all plugin does it it will be a mess (not even sure if you use audio workgroup) but this was not the point.
I’m no expert on writing DAWs, but the DAW is in control of the entire audio graph, so I’d imagine there are some sophisticated things it can do to schedule and keep all signal paths synchronized. In a plugin, you’re just a node in the DAW’s audio graph, so the best thing you can do is keep your processBlock
’s execution time as deterministic as possible, which is difficult when using multiple threads.
It sounds like @wavesequencer has tested this and it seems to work, so that’s great. But I still would not recommend this to anyone, I still think the best answer to the original question of “how can I speed up a synth using multithreading” is “multithreading isn’t the best answer”.
There are no magic/sophisticated thing that DAW does that you can’t do in your plugin.
You can be tricked if daw process you with a different buffer size than the driver audio latency, but you can probably detect it.
Yes, this is hard, tricky and error prone.
But this is possible.
Again, this is hard, tricky and error prone.
But if there are only a few plugin, it may be worth it.
Still the best way is to do this in cooperation with the DAW, see CLAP multithread extension.
But given the number of core on modern system, if your plugin really requires it, then, this is worth it.
First off, a disclaimer - I’m a relative newb to developing synth plugins… I’ve only made one so far, and I don’t have a deep knowledge on the inner workings of OSs/schedulers - I’ve just gone down this route through testing various options and seeing what appeared to work - I would not be surprised if I’ve done something ‘bad’ in my implementation, however testing has shown good/reliable results (until the issue with P/E cores thread demotion on Mac silicon).
For a complex plugin like mine which offers up to 16 simultaneous layers of 32 voice polyphonic synth patches with pretty much unlimited patch architecture/number of processing blocks, it seems essential to offer multi-threading (and the performance is amazing on many core processors), but it’s also right to offer a low or single threaded mode - so the user can choose based on how they want to use it - with many other plugins, or just as the main plugin.
For a more straightforward single layer/few oscillator synth/effect then for sure the benefits of multi-threading probably are outweighed by the complexity/potential downsides for most developers.
I think the CLAP hosting approach is great (from what I’ve initially understood, giving the DAW more control over how to allocate individual plugin voices across cores/threads also using a thread-pool) and I’m waiting for JUCE to support that out of the box so I could include it easily. In Hyperion I am doing pretty much what I understood CLAP will enable in terms of being able to treat each voice as a separate thread - except that I have to manage it rather than the DAW currently - without that capability, running my plugin in a DAW in single threaded mode will cause a very unbalanced load on modern CPUs with many cores which can actually lead to worse performance for other plugins being processed on the same core/main audio callback thread than if the multi-threaded processing is enabled.
Going forwards I will probably provide multi-threaded processing of voices as an option in any further synths I develop - it’s an option that can work well but can easily be turned on or off. I need to figure out the Mac Audio workgroups thing ASAP, but I really hope JUCE can bring that support soon - because I really don’t want to have to write/manage platform specific hacks.
1 Like
I just saw today a flag that enables p threads on Mac in juce. Don’t know it’s default though
-pthread
: This flag is needed on Linux and macOS to enable multi-threading support in JUCE.
I can’t comment on that - but whatever thread API you use, you for sure want to be joining them to the ‘audio workgroup’ for the plugin processor main audio callback if you intend to properly support Mac silicon multi-threaded audio. I didn’t enable/set any flags in the JUCE API to start multiple threads - just create (juce Thread class) and run them.
If you have a look at other posts I made on the forum recently they show how I handled the audio workgroup API (for standalone and AU plugins) - if you use that and JUCE’s new thread priority mechanism, then Mac OS is going to select the appropriate core/s and prioritize threads as it sees best in theory to provide the most reliable performance - this appears to have solved the issues I mentioned before about real-time audio threads having their priority dropped unexpectedly (there is still no clear solution for VST3 on Mac silicon though as there is no VST3 API function to query the audio workgroup of the plugin main audio callback - at least not yet as far as I know).
1 Like