How do DAWs like Cubase handle threading when there are numerous instances of a synth?

I am trying to figure out some performance issues in a complex synth I have built.

At maximum settings, it takes about 4-5 ms to render 1024 samples through it (CPU + GPU processing time together as it uses GPU also).

In theory then, as long as I do not run out of GPU cores (and parallel coordination capacity) and CPU threads/cores, then I should be able to run as many instances as I want, right?

I mean, if each synth is run on a different thread and as a parallel GPU process (hypothetically without hitting any parallel GPU processing bottlenecks), and each only processes for 4-5 ms, then one should be able to have 10-32 or however many cores worth of instances, right?

I have read that when you run synths or multiple tracks through an audio bus, this typically then forces that into a single threaded pathway, however. So if you had three synths going to one bus, they are all processed on the same thread. Or are the threads still separate for the synths, then passed into a different mix thread?

What about the mix bus? Does that converge the threads? Or is that treated as a separate thread to mix the results of the other threads?

Are DAW synth threads typically stable over the sequential buffers? Or do the threads they are assigned to change randomly?

I have done a bit of research and found:

    std::thread::id this_id = std::this_thread::get_id();
     std::osyncstream(std::cout) << "thread " << this_id << " sleeping...\n";

I am thinking perhaps the only way to know is to run this get_id() in the audio processing of the synth and debug it to the GUI of the synth and then just see what I get from one instance to another in the DAW as it plays back.

Is that a reasonable way to see what is going on?

Essentially, I am just trying to narrow down where my performance bottleneck from stacking 3-5+ synths in the project is. Perhaps it is a GPU bottleneck, but that is tough to evaluate. I’d like to be sure they are not running in series or sharing threads on the CPU/DAW processing somehow first.

Thanks for any thoughts.

4ms rendering time is 2x as long as the deadline for a 1024 sample at 48k and not realistic. If your synth takes this long to process it needs to optimize. Keep in mind that a DAW may have dozens/hundreds/thousands of processes it needs to fit into a single processing window and if a single process take multiple milliseconds then it’s going to be a problem child.

The GPU is also not at play from the perspective of the host, and the plugin (if it uses the GPU for rendering) should be more concerned with the latency and synchronization. The vast majority of audio processing does not happen on the GPU.

Multithreaded audio rendering is a complex topic, but in general you cannot assume that the processing happens on the same thread between processing calls, just that multiple processing calls don’t happen concurrently. You also can’t reason about which thread gets delegated which process during rendering, only that the order is stable.


Essentially the answer to the question is “it depends” but it also doesn’t matter. You cannot reason about the rendering algorithm enough to estimate how many instances of the plugin can run in a single host - because in the worst case, it can all happen on a single thread (and single threaded hosts exist). You do not get a new thread for each instance of the synth, the host will use a thread pool and have a complex algorithm to spread the workload and in the best case, render as much as possible in the time window of the audio callback. But if your synth takes multiple milliseconds to render a single buffer you have bigger problems. It should take a small fraction of that time to render a single buffer.

Okay. I will test the system to find out what is happening. I suspect this is the problem. Ie. I suspect the DAW is running multiple instances of the synth on the same threads or in series rather than parallel.

If it is waiting for multiple synth instances to process in series (for which there is no good reason) then it would explain the stuttering.

If they are being run in parallel it would not.

There is no way to optimize the synth further. It is highly complex and maximally optimized to use the GPU as the CPU could not manage this much processing at all. 4-5 ms includes all the time completely for CPU and GPU processing per synth instance for each audio buffer of 1024 at 44100 Hz.

What “deadline” are you referring to? Is there some objective deadline or are you speaking in general terms?

It matters to me because I need to know whether the stuttering I am experiencing is due to the DAW running multiple synths in series on a single thread rather than in parallel as I would like it to.

I must know where the bottleneck is coming from though it sounds like this is where it is coming from.

If I have say 4 instances only of the synth in the DAW (4 tracks) and I get stuttering, would debugging out the audio render block’s std::this_thread::get_id(); for each of the four synths to the synth GUI’s and checking each during playback prove the point?

Ie. If they are all on different id does this then prove they are being run in parallel?

And if they are in fact being run in parallel, why would it stutter? Each will be done in 4-5 ms no different whether there are 1, 2, 3, or 4 of them, and the DAW still has 17-18 ms to just basically mix the buffers to output and set the new midi data to each, which is insignificant work to do.

Unless there is a GPU bottleneck I am not realizing I am hitting which is somehow forcing the GPU commands to operate in series. This is why it is important for me to clarify.

I fear you’re mistaking parallelism for concurrency. The host has very good reason to render multiple instances of a single synth on the same thread, or M instances on N threads where N < M.

What “deadline” are you referring to? Is there some objective deadline or are you speaking in general terms?

When the host renders audio to an output device it has buffer_size / sample_rate seconds to fill the output buffer and return, otherwise a buffer underrun occurs and you hear an audible click.

4-5 ms includes all the time completely for CPU and GPU processing per synth instance for each audio buffer of 1024 at 44100 Hz.

Apologies, I was off by one in the calculator.

1024 samples at 44.1kHz is 23ms. If your synth takes 4-5ms to render 1024 samples that means you can fit 4-5 instances with no other plugins, because you must assume the host is only using one thread (which is the worst case). Some hosts have multithreaded renderers but the max parallelism depends on the topology of the audio processing graph and may wind up being single threaded anyway.

That’s an immense amount of time and a very low instance count and I’m guessing includes I/O latency from communicating with the GPU. So you need to add latency or re-architect to do better.

You can get dozens/hundreds of instances of a single synth on one thread with no GPU involved at all, so I doubt that your synth is so special that you cannot optimize further. My guess is you’ve prematurely optimized to render things on the GPU without accounting for the inherent overhead of the communication between CPU and GPU since all audio needs to return to CPU.

Yes, I agree that obviously you can pack an unlimited amount of simple subtractive or FM or sample based synths on a single thread. This is not that. There is no plausible way I could be taking 4-5 ms on something so simple, including requiring GPU given that as I said modern CPUs are too slow.

There exist many audio synthesis tasks far beyond the real time processing reach of even the most advanced consumer grade CPU’s and GPU’s in existence. eg. The complex realm of true physical modeling.

In any case, my programming approach and design of the synth are a different subject. Please just assume for the sake of discussion that my code is what it is. I am asking about why the DAW can’t keep up, as I don’t understand that issue.

It does appear that as you say: “If your synth takes 4-5ms to render 1024 samples that means you can fit 4-5 instances with no other plugins, because you must assume the host is only using one thread (which is the worst case).”

This is roughly what I’m getting, despite my hardware being able to handle way, way more. However, I still don’t understand why this would be the case or have to be the case.

My CPU has 18 cores. My GPU can handle 128 concurrent kernel requests. I am not coming close to using all the cores on the CPU or GPU before stuttering starts. There is no hardware limitation I can understand to processing say 4 of these synths in parallel in the DAW and completing within 5-10 ms in circumstance, given each only takes 4-5 ms.

If you have an empty project with just 4 synths in it, and all that hardware to utilize, are the common DAW’s so stupid as to not know they can simply run the synths in parallel?

I could certainly program many more of these instances to run in my own “DAW” in parallel without any issues (although I obviously don’t have time or resources to build my actual real own DAW).

The hardware is certainly up to the task of running many instances in parallel. Why not the software?

The reason the software can’t keep up with the hardware capabilities is because real-time audio rendering is soft realtime (meaning that deadline failures are errors, but not system wide, compared to hard realtime where a deadline failure is a crash), while the operating systems that we run the software upon are not realtime operating systems. You cannot work around this fact without building your own hardware and software stack in tandem.

Hosts are not “so dumb” as to be unable to render multiple instances in parallel. They can, and some do (tracktion engine has a multithreaded implementation for example). The struggle is synchronizing multiple threads in less amount of time than the audio callback is allotted by the audio subsystems of the OS that they rely on.

This is a very difficult problem that many hosts just don’t do, because they’re decades old and the idea that they have 18 threads of CPU available to them is relatively new - while most plugins don’t need to be spread amongst many cores.

When it comes to using a GPU, I think you may be neglecting i/o latency. There’s no free lunch there, it takes time to move data to and from the GPU which may outpace the time on the audio thread. The solution to that problem is to introduce latency, where you buffer incoming events and render audio and then delay its output such that its synchronized, but enough time has passed that you can get the work done and transfer it back albeit it delayed by some amount.

Most plugins find that the overhead of i/o transfers to the GPU outweighs the computational benefits, which is why it’s CPU bound

It’s not as simple as the hardware being up to the job, once all the processing is done if it’s on multiple threads it still all needs to be synchronised to one thread in order to output it to the audio device in a single buffer. This thread will be very high priority and the OS may limit how many threads can have such priority. This means even if you could launch more threads, aside from the additional overhead, you’re also likely to cause priority inversion. A DAW may also already choose to run the plugin in a separate process which also adds more overhead as it is. There are exceptions to the rule of course (see Audio Workgroups) but IMO I think it’s wise to assume one thread is running by default.

2 Likes

Re: the GPU, the input output latency is measurable as part of the audio block and processing. As stated, the whole thing takes 4-5 ms from start of PluginProcessor starting its render block to end including all processing latency. The GPU commands block the audio thread during this process as obviously the GPU must finish before the CPU can continue. So it is all very measurable just with chrono timers.

This synth is mostly for my own entertainment at this point. If I had more processing power available, I would expand it even more actually or build even more complex models. I wish computers were even faster.

In any case, I am curious about where the bottleneck is coming from, as if it was a GPU issue, I could just install more GPUs and manually spread the synths across them.

But I suspect from our conversation this is an issue instead of the DAW running them in series when I would like them to actually use my excess hardware in parallel.

Certainly I agree most average common processes don’t need a GPU, but again that is a separate issue and question. The net benefit of the GPU is dramatic in my case.

But if I can’t force the DAW to use my hardware to its full capacity (or find a DAW that is good at that), there will be no benefit to expanding the number of GPU’s. I will just have to freeze and unfreeze tracks to use it. Which is silly given again by my estimates the existing hardware should be able to handle at least 10+ instances in parallel. And if I expanded the GPU’s I should be able to handle 16+ instances in parallel. The hardware certainly could.

Thanks for your thoughts. What a depressing and unnecessary limitation. Too bad we can’t command the DAWs on how to multithread.

That is helpful to consider. I think I understand more. But again, look at the simple case.

  • 4 synths in an empty DAW project
  • 16 fast CPU cores
  • 128 concurrent kernels on the GPU capacity
  • 4-5 ms per synth to process each 1024 sample block with 23 ms of time allowed

If all synths were start simultaneously and run on separate threads/cores, they will ALL be done within 4-5 ms. Is multi-threading so bad it should take 16-17 ms just to sum these four thread outputs? This seems implausible.

Except if the common DAWs are intentionally designed to not expand thread counts much due to the net negative performance this would pose in average use cases if they did so too quickly. Which is probably the case from what you describe.

Too bad. If one could right click a track in a DAW and say “use separate thread” this would I expect certainly solve the simple case I described above.

It sounds like running the std::thread::id this_id = std::this_thread::get_id(); command and debugging out to the GUI while the synths are running in the DAW will be the best way to see, so I will try this next to clarify.

Found this thread on Cubase in which one user suggests:

Only the developers know, and they probably want to keep it secret

Multi-threaded programming is hard, really hard, and there are limited tools available to make it easy. It’s a hard problem that has been studied by programmers and computer scientists since the 70s

Some things are easy. If you have several totally independent programs running at the same time, it’s easy to efficiently use multiple cores. The key word here is independent. Even programs that appear to be independent share the disk, the screen, the network connection
etc

Processor makers hit a wall years ago when they couldn’t make a single processor run any faster, so they introduced multiple cores. The marketing department advertised them as a step forward. Programmers knew that it wasn’t that simple

If I had to guess, Cubase handles multi-core/multi-thread in an imperfect way, using best effort and a lot of secret tricks. It seems easy to imagine handing one virtual instrument in this core, one effect in another, etc. Unfortunately, the real world is a bit more messy, and unexpected timing constraints pop up all the time

I would love to read a paper by the Cubase chief software architect, clearly explaining how threads and cores are used. Methinks this will never happen

BTW, I’ve been programming since 1972, and have done a lot of work in multi-threaded realtime systems

Sounds like if he is right it is all just each programmer/company’s personal voodoo and heuristics for the most common average user case. But we will never really know, minus perhaps what can be gleaned via testing with the kind of thread ID checks I describe.

Also interesting of note, I have tried Reaper in the past, and I see many people saying it is far more efficient with multicore situations even going back 10+ years.

For example:

Reaper is one of the most efficient multitrack applications I’ve used over the years. It can run lots of instances of heavy‑duty plug‑ins and soft synths — probably more than many of its competitors — without stuttering to a halt. The default Reaper settings work well with eight‑core CPUs and beyond, typically offering over 95 percent utilisation of all cores.

To achieve this efficiency, Reaper mostly uses ‘Anticipatory FX processing’ that runs at irregular intervals, often out of order, and slightly ahead of time. Apparently, there are very few times when the cores need to synchronise with each other, and using this scheme, Reaper can let them all crank away using nearly all of the available CPU power.

That was from 2011: Running Multiple Plug-ins

I will test that also to compare and again use the thread ID’s to see what is happening. Sounds like this all comes down to the DAW.

I might repeat some things that were already said in a different way, but I wanted to summarize:

  • the host calls the plugin processing function, which has to fill or process the whole buffer of samples and midi events that were given to the plugin
  • this is therefore single threaded
  • It is up to the plugin what it does and how it achieves that. there are only two rules:
    • return within the time
    • have all samples ready

How long that processing may take, holy-city gave an upper limit with blocksize/samplerate. But even that is a very optimistic value, because there is also time for the host to swap the buffers, mix them and then there are other plugins on the track, that have to process in series, because their input depends on the output of the previous plugins. So your time is a fraction of that.

When you spawn threads to fulfil the task, it is the OS scheduler which matters about which thread isrun and what gets processed. The host has no influence there.
If you wait for those threads to finish, then your plugin’s priority drops to the lowest priority of all the threads. This is called priority inversion and usually means that the plugin doesn’t fulfil the realtime requirements.

When you hand some parts to the GPU, it becomes even more complicated. Bottlenecks are usually not the GPU cores but to shuffle the data to the GPU and back. And now you are dependant on the GPU as well in terms of your realtime requirements.

Latency to and from the GPU doesn’t mean how long it takes in terms of processing time, but when you send data and receive data, usually you get data that was sent in the past. How long in the past is your latency.

Last but not least, all the usual OS operations also take away resources, writing to disk when tracking, GUI display and even tasks completely unrelated like fetching your email in the background, God knows what else


2 Likes

Absolutely fascinating experiment. I encourage anyone who is curious about this to try it themselves.

Experiment

  • Add a string like currentThreadID to your MPESynthesiser or Synthesiser class (or PluginProcessor).
  • Add to the page where you are overriding or performing your custom rendering and you will be getting the thread ID:
#define WIN32_LEAN_AND_MEAN //use this to fix windows.h include byte errors you will get otherwise //https://cplusplus.com/forum/general/282167/
#include "windows.h" //use this instead of processthreadsapi.h or will get errors //https://stackoverflow.com/questions/4845198/fatal-error-no-target-architecture-in-visual-studio
  • Get the system thread ID inside your custom audio rendering loop Synth/PluginProcessor with: currentThreadID = std::to_string((uint)GetCurrentThreadId());
  • (For me I did it in my MPESynthesiser custom rendering function that is called by PluginProcessor’s processBlock.)
  • This will give you an integer to identify the Windows thread (I believe generic C++ threadID I mentioned before does not tell you the actual system thread ID so will not be useful to compare across different synth instances).
  • Get this string in your AudioPlugInAudioProcessorEditor object via the automatically given AudioPlugInAudioProcessor& processor inside it and debug it out to the screen through a label.
  • Update the label on a Timer at a given rate (make AudioPlugInAudioProcessorEditor inherit from Timer if it doesn’t already, then startTimerHz on construction and override timerCallback)
  • Load up numerous instances of the synth in your DAW and watch them as many tracks play back.

Result:

  • The thread ID’s bounce around like a yoyo! They will change multiple times a second sometimes (I am running Timer at 15 Hz so far only so that is my resolution) but sometimes last up to 2-5 seconds before changing.
  • Cubase (only one I tested) does a great job of splitting up the work on multiple threads - assuming the system thread ID method above is sound - 4-5 instances of my heavy synth almost never share a thread ID - at least I have not seen the same ID’s on screen at the same time across 6 synth instances with this method.

Conclusion:

  • I think this is the correct way to definitively identify what the DAW is doing.
  • Given DAW’s otherwise have proprietary methods to thread (from the sound of commentary I posted above) the only way to know is to check.
  • DAW’s actually do a very good job of multithreading right out the gates.

I think @holy-city you nailed something for my issue that I just realized overnight - you said “I fear you’re mistaking parallelism for concurrency.”

I think you are correct, but re: my GPU processes specifically. Specs for NVIDIA GPU’s are up to 128 process concurrency but not necessarily parallelism. I am probably running into a parallelism bottleneck with the GPU, as the DAW seems from best initial evidence to be multithreading the processes beautifully.

I have ordered a new PSU and will try adding another GPU and distributing synths across multiple GPU’s. If that solves my issue, then it proves my bottleneck was the GPU, not the CPU/DAW/multithreading. If not, then it is not the GPU.

As one final note, I tried Reaper and it reached the same bottleneck as Cubase despite supposedly having better multithreading which also fits with this conclusion and the observation that Cubase seems to multithread already quite fine.

Thanks for everyone’s thoughts. I will see what happens. :slight_smile:

As a point of interest, if anyone is curious, adding more GPU capacity allowed me to use more instances of the synth. HW Monitor also showed I was maxing the GPU before I added more.

However, Cubase also hit a limit in terms of number of instances after that where I wasn’t maxing the multiple GPUs anymore. This is likely what people described here in terms of threading problems, the DAW managing sufficient high level threads, etc to cope.

Switching to Reaper opened up that bottleneck. Reaper truly is far more efficient at multithreading than Cubase. It is not even close. Reaper only stutters once the now multiple GPU’s are fully saturated in HW Monitor.

Crazy how good Reaper is. Crazy how poor Cubase is. So far I prefer working in Reaper as well. No “Synth Rack” nonsense either. Very interesting overall.

I’m shocked that the cheapest least fancy DAW (Reaper) is also dramatically better in a massively obvious way than any competition. Massive difference.

You should give Waveform a test too:

That video gives you some insight into how things are scheduled.

I’m not completely familiar with how modern GPUs work internally, but I decided to drop this thought here so you can check out if this could be the problem:

Usually the software is responsible for organizing the data buffers that are sent to the GPU. That means that the software itself has to design how it is possible to handle parallel processing on GPU.

Now it could be that GPU will not process more than one such group of data at a time. I.e. if you have several plugins which send their data to GPU, the GPU won’t parallelize all of them automatically, but processes them one at a time.

Each plugin instance could organize its own data so that GPU processes it in parallel, but this this might not be the case when several different plugin instances push their data to the same GPU at the same time. I.e. GPU processes them one by one.

Just something to look into if this is the case. If it is, then your current approach might not work as intended.

Yep, Reaper in many ways is pretty awesome. From my experience it’s also the fastest to load which makes it a good test DAW for debugging your plugin from an IDE.

1 Like