How to use multithreading so each synth voice renders on a different CPU thread/core?

We now live in an era of 64 core processors, but yet individual core speed is not increasing significantly. This means if someone wants to really push the limits of what available we need to use multithreading.

I am working on a synth where the complexity/quality of the rendering is scalable to an extent and if I want to push it up, it would be nice to be able to switch it so the voices all render on different cores. The parent synthesiser itself could take yet another core or share with one of the others. So a 6 voice synth might use 6-7 voices.

I am wondering how I would do this in the context of an MPE synthesiser. Even more complex than that, the next question would be: Is it possible to synchronize the processing of these cores sample-by-sample so that I can feedback outputs from one voice into the others for crosstalk effects?

Letā€™s say my rendering right now is done in PluginProcessor.cpp as:

void AudioPlugInAudioProcessor::processBlock (AudioBuffer<float>& buffer, MidiBuffer& midiMessages)    {
	const ScopedLock renderLock(lock);
    ScopedNoDenormals noDenormals;
	buffer.clear();
	mMpeSynth.renderNextBlockCustom(buffer, midiMessages, 0, buffer.getNumSamples());
}

Then in my MPE Synthesiser I have the following:

void renderNextBlockCustom(AudioBuffer<float>& outputAudio, const MidiBuffer& inputMidi, int startSample, int numSamples) {

	MPESynthesiser::renderNextBlock(outputAudio, inputMidi, startSample, numSamples);
	//custom block based processing		
	//...	
	}

void renderNextSubBlock(AudioBuffer<float>& buffer, int startSample, int numSamples) override {
	renderOneSample(buffer, startSample, numSamples);
	//to force sample by sample processing in renderNextBlock
}

void renderOneSample(AudioBuffer<float>& buffer, int startSample, int numSamples) {
		const juce::ScopedLock sl(voicesLock);
		for (auto i = 0; i != numSamples; ++i) {
		const auto sampleIndex = startSample + i;

		for (auto* voice : voices) {
			if (voice->isActive()) {
				MPESynthesiserVoiceInherited* voiceInherited = (MPESynthesiserVoiceInherited*)voice;
					//sample by sample retrieval of outputs from the voices
					//put output from other voices back into each voice
					voiceInherited->renderNextBlock(buffer, sampleIndex, 1);
			}
		}
	}
}

I believe that works to get the sample by sample output of each voice or each voiceā€™s internal values into the synth and put them back into each other voice so theyā€™re shared. (I tested it and had it working with that approach in principle.)

With that general architecture, how would I go about starting to specify if a voice goes to a given core? Where do I create the threads and tell each voice to get its own?

I see from this thread some ideas for how it might work. But thatā€™s over my head a bit. I understand from that i would need to call as many threads as I wanted to get them started in prepareToPlay() like this:

const int numThreads = 4;
OwnedArray<TimeSliceThread> threads;
for (int i=0; i < numThreads; ++i) {
    threads.add (new TimeSliceThread)->startThread();
}

But the other stuff discussed in that thread seem very specific to what that person was asking about and Iā€™m not sure how to generalize a solution to my synth. Iā€™ve never used Thread classes in JUCE or coded anything with control over the threading or what goes to which thread or how many threads there are.

Are there any basic points or example code you could provide that might help me understand how to do this?

Even if I canā€™t synchronize the cores for the sample-to-sample voice feedback, Iā€™d still just be happy at least to start to be able to force the voices each to a different thread.

Thanks for any guidance.

I made one synth with a thread-pool i.e. it spawned 4 threads which awaited commands from the ā€˜mainā€™ thread. So stuff to research would be std::thread, and the use of non-blocking FIFOs for communicating with the ā€˜workersā€™ and condition variables to signal threads when work is available.

Wouldnā€™t the overhead of scheduling & synchronizing between cores basically eliminate any speed gains you get from parallelism?

And what kind of synth algorithm necessitates this? Just because we have 64 core machines, doesnā€™t necessarily mean that computers are incapable of performing many tasks using just one or two of their coresā€¦

I would be very surprised if you managed to write DSP code for a single synth voice instance that uses 80% of a CPU core all by itself. (If a single synth voice is using that much CPU by itself, Iā€™d take that more as an indicator that you need to write better DSP code rather than ā€œmy computerā€™s hardware canā€™t handle thisā€)

Letā€™s presume hypothetically I am doing something that needs all that processing power. Some methods of real world physics modeling still actually far exceed anything a single core can still do. This is a hobby of mine so I donā€™t mind things I make being impractical for other users. Things I make are mainly for me and my interests. Parallelism would help because if each core needs to run at 80% to render a voice, then 6 cores in parallel can render 6 voices. Even if there is an extra burden of CPU to synchronize and schedule, I canā€™t imagine it would cost that much.

However I donā€™t know how difficult it would be to synchronize the cores/threads as Iā€™ve never done this before.

So the question is basically: How do I get 6 threads started and assign each voice to one of them in my PluginProcessor or MPESynthesiser sections?

Even if each voice is running through each sample block in their own manner in a non-sample-synchronized fashion I would be happy to figure out that to start. If I can make that happen, thereā€™s no reason I couldnā€™t run it sample-by-sample to maintain sync. The block will have to be still read out through my PluginProcessor and/or MPESynthesiser either way as a unit and passed off to each voice/thread, so I actually donā€™t think the synchronization will be an issue. As long as each thread does its task thereā€™s no difference probably if the Synth gives each thread the whole block at a time or sample by sample.

I think hypothetically the PluginProcessor/MPESynth could give the MIDI/input buffer per block to each voice/thread, they could each output an audio buffer to give back to the PluginProcessor/MPESynth, and then I could sum those together to get the final output. If the threads are all working on their own copies of the input/output buffer, then there would be no conflicts from them trying to write/read from the same data simultaneously. I would just need some way to confirm when they are all ā€œdoneā€ in the PluginProcessor/MPESynth to sum the result and give them each the next block.

JUCE obviously has Thread classes built into it to give the freedom to create threads so I presume it is possible to do so for this as well. Itā€™s just hard to find any basic information on how to use them or make them work.

Any help would be appreciated. Thanks

The Juce thread class is amazing and helpful and wonderful, BUT it has the unfortunate effect of causing confusion for newer developers. I think that people see a ā€œThread classā€ and begin to make the assumption that a class ā€œlives onā€ or is ā€œtied toā€ a thread.

You cannot ā€œassign a synth voice to a threadā€. You can create a thread, and have that thread call a method of your synth voice.

The reason this is an important distinction is because any of the work you want to offload to your additional threads, you must do so manually. The thread will be spinning in a run() loop, and to get it to call a method of your synth voice, you have to send a message to the thread to tell it ā€œhey time to run this methodā€.

What this means is, any communication from the main synth to the voices now becomes a huge pain, and any communication from the voices back to the synth now becomes a huge pain. There are many edge cases you will run into:

  • When the processor has some more work for a voice to do, maybe that voice thread is already completing another job
  • When the processor needs to sum all the voices together and output the aggregate audio, maybe some of them arenā€™t done yet
  • and many, many moreā€¦

Take a look at the basic way that the Juce synthā€™s processNextBlock function works: https://github.com/juce-framework/JUCE/blob/90e8da0cfb54ac593cdbed74c3d0c9b09bad3a9f/modules/juce_audio_basics/synthesisers/juce_Synthesiser.cpp#L157

What it does is break the audio buffer into small chunks in between each midi message, so that the midi stays synchronous with the audio. But what ends up happening is this:

parent logic ā†’ get a few samples from voices ā†’ parent logic ā†’ get a few samples from voices ā†’ ā€¦

Now, imagine that every single time you want to communicate the ā€œparent logicā€ to the voices, and every time you want to get a few samples from the voices, you have to deal with synchronous callbacks, thread waiting, possibly FIFOingā€¦ each of which you must address individually for every single one of your voice threads at every single one of the moments where you want to communicate parent ā†’ voices or voices ā†’ parent.

One possible solution for the parent ā†’ voice communication bottleneck would be to have the midi handling logic replicated in each thread as well, so that they actually donā€™t depend on one central ā€œparentā€ā€¦ but again, that adds 6x the overhead for the ā€œparentā€ logic being done that many times, plus possible inconsistencies, plus now info is harder to access from outside the synth about its current stateā€¦ etc etc etc

All of this goes to say, I think this is opening a very very deep rabbit hole, chasing an architectural decision that I believe is ultimately a bad one and that I suspect will actually worsen the performance of your app.

1 Like

Think about what you just suggested here:

  • Processor ā€œsendsā€ audio to a worker thread (even though thatā€™s not really how it works)
  • Processor waits while worker thread processes audio
  • Worker thread sends audio back to processor

This scenario achieves nothing except making things more complicated and error prone, because the processor still would have to wait for the worker thread to process and then ā€œsend backā€ its audio. In fact, this would probably take more time than just doing it in the audio thread, for several reasons:

  • Transit time (ie communication of worker thread telling the processor ā€œok Iā€™m done now, take the audio backā€, plus this will probably add a few copying operations)
  • The main audio thread Juce sets up for you is run by the OS as a high-priority thread, so the OS will be nice to it and give it more time and, if necessary, interrupt other threads so that the audio thread can do its work.

In 99.99% of cases, the best practice is to do all of your audio rendering in the audio thread that Juce provides for you.

Thanks for indulging me Ben. I appreciate your thoughts and insight. Iā€™m learning from what you said.

That helps me understand more how threads work. As you described, they are basically just ā€œrunā€ loops that I would have to instruct and juggle very carefully.

But in my hypothetical situation, where say I have each voice using 80% of a CPU core, and I load 6 instances running in parallel inside my DAW, what is the difference there?

Every instance must remain synchronized. They donā€™t need to be sample synchronized, because theyā€™re each working on a buffer, but the buffers all need to be done by the end of the block so they can be put together by the DAW into a final sound buffer that works.

So the DAW is I managing all these 6 synth threads to ensure that happens, right?

So there must be some way to practically juggle multiple audio threads and still have them synchronized at least buffer to buffer.

If I have an array of 6 threads within my main synth, then couldnā€™t I just take my piece here:

void renderOneSample(AudioBuffer<float>& buffer, int startSample, int numSamples) {
		const juce::ScopedLock sl(voicesLock);
		for (auto i = 0; i != numSamples; ++i) {
		const auto sampleIndex = startSample + i;

		for (auto* voice : voices) {
			if (voice->isActive()) {
				MPESynthesiserVoiceInherited* voiceInherited = (MPESynthesiserVoiceInherited*)voice;
					//sample by sample retrieval of outputs from the voices
					//put output from other voices back into each voice
					voiceInherited->renderNextBlock(buffer, sampleIndex, 1);
			}
		}
	}

And run that voiceInherited->renderNextBlock(buffer, sampleIndex, 1); line via a different thread for each voice? ie. Thread 1 for voice 1, thread 2 for voice 2, etc.?

As long as each thread is allocated to a different CPU core and all cores are fast enough to keep up with each of the voices, there would be no reason to expect them to desynchronize or fall behind.

Since theyā€™re working in parallel I still think Iā€™d have to let them each work on their own copy of the buffer and then sum the audio from all the threads/voices at the end before the next buffer is called.

I think I canā€™t directly control how the threads are distributed to cores by Windows or my DAW. But in general I believe Windows/DAWs will split up high load threads among various CPUs. If I load two standalone synth or VST plugin instances that are using 80% of a core each, they will automatically be split to two different cores so they can run smoothly. So I think Windows or my DAW would automatically do the same with my threads.

What do you think? Is that making a bit more sense? I donā€™t mind this failing or creating other weird behaviors. Either way it is worthwhile for me to try because the alternative of loading a new synth instance for every voice just so each voice can get its own thread is very inefficient as well. So either way itā€™s a bit of a pain in the ass. Iā€™d rather see if I can create the threads myself and make them cooperate.

Yes, and it takes millions of lines of code to make a product like Ableton Live that can juggle multiple signal chains each getting their own processing thread and working reliably in real time. It is by no stretch of the imagination an easy, or even reasonably achievable task.

Like I said before, thatā€™s not how it works. Each thread object will have a run loop constantly spinning that might look something like this, in pseudocode:

void run() override
{
    if (shouldExit())
        return;

    if (parentNeedsANewSample)
        calculateNextSampleValue();

    if (parentWantsToDoSomeMidiLogic)
        reactToAMidiEvent();
}

So the ā€œvoice threadā€ itself might put some samples into a buffer owned by the SynthVoice object during its calculateNextSampleValue() function, and then the actual synth can do:

for (auto* voice : voices) {
			if (voice->isActive()) {
                if (voice->hasASampleReady()) {
				    grabNextSampleFromVoiceBuffer();
                } else {
                    // uh oh! now we're in trouble!
                }
			}
		}

So you can see how keeping things in sync between what samples the thread is working on and which ones the processor is grabbing can be quite a headache.

Actually, the opposite is true. You can never make any assumptions about CPU clock speeds, or thread execution times. That will only ever get you into trouble.

You canā€™t really assume that. The DAW wonā€™t do anything at all with your threads, and the OS may decide to schedule your threads on the same core as some other running processes, depending on the workload of your computer and the way the kernelā€™s scheduler is written.

One of the reasons I am warning you against doing this is because multithreading is very unpredictable. There are many scenarios that it can result in worse (and like, much much worse) performance than if you had just written it the easy way.

I have to question again why itā€™s so absolutely necessary that every single synth voice gets its own core. What kind of algorithm are you running, exactly?

I think it is much, much easier to write a more efficient & optimized audio rendering algorithm than it is to write a multithreaded synthesizer that is stable enough for live usage.

FYI: there is interesting threads about that in this forum (search for ā€œthread affinityā€).

For instance:

it takes millions of lines of code to make a product like Ableton Live that can juggle multiple signal chains each getting their own processing thread and working reliably in real time. It is by no stretch of the imagination an easy, or even reasonably achievable task.

I was afraid you would say that. It occurred to me as well when I started thinking about the similarity.

Thinking about this more, I am starting to understand more why buffers exist. Getting each thread sample-by-sample synced would essentially mean running them all on a one sample buffer with a sample rate timer and hoping they can all keep up. I can see why that would be harder than using a buffer and letting each render on its own.

If you are curious what types of synthesis far exceed the capacity of any single core and why this is of interest to me, you can read about finite element modeling. Depending on the equations required to model the physics you are simulating and the complexity/detail of the mesh, it is actually very easy to exceed 4-5 GHz no matter who is programming it.

For example, here some researchers modeled a snare drum in this manner and built a visualizer for it, but this could in no way run in real time, and itā€™s probably still an oversimplification. This is essentially my hobby and it is always costly unless you are using a vast oversimplification of your engineering principles to describe reality (and then that wonā€™t give the same outcomes).

I understand there may be challenges or glitches, but I would still at least like to be able to set up some threads and try running the different voices through them on a normal buffer system (no sample-by-sample sync). Iā€™d be curious to at least see what happens.

I spent some time playing with the Thread class, but I am having trouble implementing what I am imagining from your guidance so far. I am good enough at DSP for what I like to do but not very good at C++ syntax.

I can see essentially three spots I could send the task to a custom thread to run:

Assign thread the task at the PluginProcessor level:

void AudioPlugInAudioProcessor::processBlock {
	mMpeSynth.renderNextBlockCustom(buffer, midiMessages, 0, buffer.getNumSamples());
}

Assign thread the task at the MPESynthesiser at the block or sub-block level:

void renderNextBlockCustom(AudioBuffer<float>& outputAudio, const MidiBuffer& inputMidi, int startSample, int numSamples) {
	MPESynthesiser::renderNextBlock(outputAudio, inputMidi, startSample, numSamples);
}

void renderNextSubBlock(AudioBuffer<float>& buffer, int startSample, int numSamples) override {
	...
}

I would imagine the task would be simplest if tackle it from the highest level to start (PluginProcessor).

Letā€™s say I just want to start by running a single custom thread in my PluginProcessor which will handle the buffer-based processing of mMpeSynth.renderNextBlockCustom(buffer, midiMessages, 0, buffer.getNumSamples());. No breaking up into voices yet or multiple threads. I just want to have a thread running at that level and assign it to run() that task.

How would I do that?

I thought I could create a class based on Thread and pass in the buffer and synth pointer to it. Then I could specify under its run() override that I want it to perform the mMpeSynth.renderNextBlockCustom(buffer, midiMessages, 0, buffer.getNumSamples()); task.

If I can do that, I can start to see how the threads get managed and make multiple to start specifying which voice each should process to see what happens.

Here is my attempt at creating a Thread inherited class that would take a pointer to my MPE Synthesiser so I can run the mMpeSynth.renderNextBlockCustom within it:

class ThreadInherited : public Thread {

public:
	ThreadInherited(const String& threadName, MPESynthesiserInherited *mMPESynthIn, size_t threadStackSize = 0) : Thread(threadName) {
		mMpeSynthPtr = mMPESynthIn;
	}
	AudioBuffer<float> returnLastBuffer() {
		return outputAudioBuffer;
	}
	void inputNewBuffer(AudioBuffer<float> outputAudio, const MidiBuffer& inputMidi, int startSampleIn, int numSamplesIn) {
		outputAudioBuffer = outputAudio;
		inputMidiBuffer = inputMidi;
		startSample = startSampleIn;
		numSamples = numSamplesIn;
		newBufferToProcess = true;

	}
	void run() override {
		while (!threadShouldExit()) {
			if (newBufferToProcess) {
				mMpeSynthPtr->renderNextBlockCustom(outputAudioBuffer, inputMidiBuffer, 0, outputAudioBuffer.getNumSamples());
				newBufferToProcess = false;
			}
			else {
			}
		}
	}
	
	
private:
	MPESynthesiserInherited* mMpeSynthPtr;
	AudioBuffer<float> outputAudioBuffer;
	MidiBuffer inputMidiBuffer;
	int startSample = 0;
	int numSamples = 512;
	bool newBufferToProcess = false;

};

Then in my PluginProcessor under private: I would want to create ThreadInherited testThread("test", &mMpeSynth); and run startThread on initialization.

My best thought for feeding it blocks would then be:

void AudioPlugInAudioProcessor::processBlock (AudioBuffer<float>& buffer, MidiBuffer& midiMessages) {
	const ScopedLock renderLock(lock);
    ScopedNoDenormals noDenormals;
	buffer.clear();
	   
	AudioBuffer<float> outputBuffer = testThread.returnLastBuffer(); 
	testThread.inputNewBuffer(buffer, midiMessages, 0, buffer.getNumSamples());
	buffer = outputBuffer;
}

The idea is every time processBlock is called (ie. a new buffer comes through) my testThread will return the last processed block it ran, and then a new buffer will get inserted into it to run. testThread only runs calculations when there is an unfinished buffer to work through then rests when there is nothing to do.

Would this make sense in principle just for this first task (implementing a single custom thread)? It is at least not giving me any syntax errors now.

Thanks again. I really am starting to understand the hazards and risks you describe of running multiple threads but I am also still curious to understand how it would work even in the simplest manner so I can see what happens.

I appreciate any further guidance you can provide.

I canā€™t provide any guidance as Iā€™ve not done this myself, but theoretically I would second all that Ben said. Of course if you have the time and the will, you can give it a try -itā€™s always an open question whether

the overhead of scheduling & synchronizing between cores basically eliminate any speed gains you get from parallelism

As a reference, Pianoteq has a multicore option -Pianoteqā€™s modelling is probably heavy, so it would be comparable. The multicore mode raises CPU usage quite a bit, and it clearly handles polyphony much better when pushed (it overloads much less). So the question is tricky -you may get off with the overhead of threading even if itā€™s not so small, if your processing is so heavy that it would overload a single core anyway. The main thing to minimize is your worker threadsā€™ idle time.

OK, so Iā€™m certainly by no means an expert on this synthesis technique. But reading through this wiki page, it seems that there are some distinct phases to the processing:

  • subdivide the large system into small finite elements to compose your mesh
  • manipulate / process each of the individual elements
  • aggregate the mesh back into a single output

Just spitballing here, but could that first phase of analysis & mesh creation happen just once instead of multiple times for every single voice? Have you already experimented with optimizations like this?

Are there any examples you can point to of this synthesis technique being used in real time? If the answer is ā€œnoā€, thereā€™s probably a reason for that.

The basic idea is beginning to be there, yes. But the details will need a lot of attention. Just looking at your processBlock code:

  • why are you using a ScopedLock? Youā€™re interested in performant audio and youā€™re writing locks in your rendering codeā€¦?
  • in two places you are copying AudioBuffer objects using the = operator, which may trigger buffer resizing and maybe even allocations. I would use the AudioBufferā€™s copy methods instead.
  • What happens if you call testThread.returnLastBuffer() and itā€™s not done yet, or it hasnā€™t processed any buffers yet?

Even if you manage to get the multithreading up and running, if you are not very very careful with the implementation details like this (move/copy semantics, allocations, etc) then you will end up with code that is much, much slower than if you had not attempted multithreading. (Not to mention crashes, race conditions, data corruptionā€¦)

To get this right, you will have to become an expert at very low level C++, and it might take literally years to get your code working to a point that itā€™s stable enough to use as a plugin.

Edit: another issue I noticed ā€“ in your ThreadInherited class, youā€™re using outputAudioBuffer to work from in your rendering, but you also directly write incoming data to that buffer in the inputNewBuffer() function. This will cause data races. You likely need to set up some kind of audio FIFO, and in inputNewBuffer it would add samples to the FIFO and in run(), it can check if the FIFO has enough samples and then read from it.

Disk Streaming Sample Playback. I measured the performance with 1 - 8 cores, 4-cores seemed to be the sweet spot, and performed better than one. At least on my PC.

I would be very surprised if you managed to write DSP code for a single synth voice instance that uses 80% of a CPU core all by itself.

In my use-case it is more about the latency of accessing a file on disk, and about not blocking the audio thread while you do so.

2 Likes