Why is my synth over 10x less efficient as a VST3 vs. standalone?

I spent several years now building a very complex modal synth using arrays of resonant bandpasses. This is just for my own use, so I didn’t mind making it extremely complex. I have maximized the efficiency of the code as best I can. In standalone, it runs perfectly with all the efficiency I need. However, in VST3 mode it is horribly inefficient and cannot run properly at any reasonable latency unless I drop the number of bandpasses (ie. modes) precipitously.

In terms of performance, to run smoothly, I am needing:

  • VST3: 8192 samples latency (93 ms) in Reaper to function with 2/3 the amount of bandpasses to utilize a core of CPU up to around 55-75%.
  • Standalone: 882 samples latency (10 ms) in Standalone with full amount of bandpasses to utilize a core of CPU up to around 75-90%.

So there are two levels of inefficiency and poor performance in the VST3:

  • I need 9x as large a buffer to get it to run smoothly.
  • I cannot utilize any of my CPU cores to anywhere near their fullest extent.

I tried two DAWs - Reaper and Cubase. Cubase was even worse than Reaper. Reaper at least I can get one instance to run smoothly with those settings. Cubase is dropping out even at those settings.

I have run Latency Monitor (LatencyMon) to check for any background problems and my system is crystal clean. I get at most two green bars on any area (ie. there are no interrupts or system issues occurring).

I have all my cores set with Hyperthreading disabled in BIOS to maximize their capacity and they are all set to the same speed for consistency.

I cannot understand the source of this insane VST3 inefficiency. If the synth works perfectly in standalone on one core at given settings, I have 16 high powered cores all running at the same clock speed, and I open a clean empty project in Reaper, shouldn’t I be able to load at least 12 instances of the synth before I run into problems? If I just load even one instance alone in Reaper, shouldn’t it be at least close to the standalone performance?

Why can’t the VST3 operate at the same or even close to the same latency as the standalone? Why can’t the DAWs (either Reaper or Cubase) come close to even utilizing most of a full core before they start dropping out?

Most importantly, is there some way to correct for this? A different plugin format? A different sound card? Something I can change about my code that might be causing this issue?

I have an insane amount of processing power. Every single one of those cores is more than strong enough to handle an instance of the synth at low latency in standalone.

Might there be any solution to get the same or reasonable close performance out of a DAW? Any ideas on what this might represent and how to fix it?

Thanks for any help.

For a starting point you should really run both, your standalone and the VST3 under a profiler to compare them and spot the parts of your code where your VST version consumes a lot more time – this approach will be a lot more helpful than guessing what could be going on :slight_smile:

4 Likes

Like PluginPenguin suggested, it’s best to use a profiler to find out the problematic parts in the code. But since guessing games can be fun too, a couple of ideas :

  • The GUI for some reason uses a lot more CPU as a VST3 plugin than as a standalone. (In principle this shouldn’t necessarily affect audio performance, but it’s still a possibility.)
  • There might be some problem with the plugin parameters handling. Maybe the host is sending the plugin a lot of parameter changes you don’t get with the standalone. (Or your GUI is doing that.)
  • Testing a debug build instead of a release build? (The performance difference can be dramatic.)
1 Like

Thanks guys. A few further points:

  • It’s being built in Release mode either way so it’s not that (Debug mode is of course even far more inefficient).
  • I notice Reaper can smoothly play back recorded tracks with the synth at normal buffers (eg. 512 samples), but when it’s auditioning/recording new MIDI input it requires the ridiculous 8192 sample buffer size to cope.
  • Cubase requires 8192 buffers both for playback and auditioning/recording and is still glitchy even at that.

I’m not sure if that means anything sensible. Obviously whatever the issue is, Reaper is avoiding it in playback at least.

So from what you’re saying, I will need to learn how to use the Profiler to solve this. I’m using Visual Studio on Win 10 x64. I can run the Profiler easily enough on Standalone. I just go Debug > Performance Profiler > CPU Usage > Start.

However, I am not sure about doing this on a VST. I’ve never debugged a VST. I tried following these instructions:

He says to create the debugging VST first (ie. Debug > Start Debugging), then copy the resulting VST into your usual VST folder. Then in the solution, go to Properties and set the Command as your host’s location (ie. “C:\Program Files\Steinberg\Cubase 10\Cubase10.exe”) and change Attach to Yes. Once that’s done you open up your host with a project using the VST and click Local Windows Debugger. Or I presume I could then alternative;y use the CPU Profiler.

However, when I do that, Visual Studio still just says: “Unable to start program ‘(VST_name)’. ‘(VST_name)’ is not a valid Win32 application.” It’s not figuring out how to find the running VST in Cubase.

What am I doing wrong here to get this working? Thanks again. At least I have hope to know this should be somehow fixable.

A shot in the dark:

Since you are dealing with multiple threads, there is a chance for priority inversion. It might be, that in a standalone the UI thread is way less busy, so the effects are much milder than in a host, where the UI thread has a lot to deal with outside your own UI.

Or even the audio thread, where you deliver the audio, is probably way less busy in standalone than in a host, where you have to share that time.

What I would do is to create a softer multi thread version, that allows to skip unfinished results and count the failures instead, so you get an idea, of how big the problem is, and if it is in the threading at all…

You can still use the ‘Performance Profiler’ to start a profiling session if you like. After clicking ‘Debug’ > ‘Performance Profiler…’, you need to change the ‘Analysis Target’ to ‘Executable’. You will then be given the option to start the ‘Performance Wizard’.

Start the wizard, select ‘CPU sampling’, click ‘Next’, then select ‘An executable (.EXE file)’, click ‘Next’ again, then browse for the host executable you want to montior. Clicking next again will take you to the final page where you can check the box to launch when you click ‘Finish’.

Before you start profiling you’ll want ‘Generate Debug Info’ (in the property pages: Linker > Debugging) to be enabled for the Release configuration so you can see the function names.

Thanks for the ideas and feedback guys. I figured out how to run the Profiler on a Debugging VST3 in the manner that was shown in the video linked - my error was that I had changed the settings on the “SharedCode” part of the Solution, not the “VST3” part, which then did not work.

When I run the Profiler on the Standalone version, I get a very detailed summary of the CPU usage for each synth function, and it all makes sense for what I would expect. However, when using the VST in Cubase, I am getting no useful or similar information. It runs for 300-450 ms and then just seems to stop itself, and this is what I get on the summary and detailed report:

I would like to see which function within the VST is using up all this insane amount of processing to necessitate the dramatically greater latency to run. This isn’t telling me anything at all. Is there something I’m missing again to get useful information from the Profiler when running it on the VST in a host?

I just tested a bit further and the Standalone runs perfectly smooth even down to the lowest latency settings available on my system with no effort, ie. with my soundcard set to 64 samples and the JUCE Standalone exe set to the lowest it allows of 265 ms.

This clearly runs perfectly fine in Standalone with not much stress. My system is very fast and while this is a complex synth, there’s no good reason it should run so fast alone and then require 8192 samples latency just to barely run one instance in a DAW. There must be something very strangely wrong, and I will need to find a way to narrow that down.

I tried just on a chance from the comments above disabling the UI altogether by just making it a blank screen in PluginEditor (commenting out the interface elements) and it was the same so that didn’t fix it. I also tested a suggestion that maybe my parameters were being changed constantly by the DAW by putting a DBG output in the parameterChanged override of my voice and it is not being triggered constantly - only when a parameter changes.

The only observation I can make about the difference between Standalone and VST function outside of the latency issue is for some reason, when it is running in Standalone it only seems to occupy one core of CPU at a time. In Hardware Monitor, I can push one core to 70-90% utilization by maxing my synth settings. But when it is a VST, no core goes above 20-75% and it seems that perhaps the DAW is trying to split it across multiple cores which obviously is not helping.

Could that be the issue, and if so, is there anything that can be done about it? ie. The DAW is trying to split up the synth across multiple cores which is unnecessarily introducing dramatic inefficiency and latency requirements?

@daniel I’m not sure how you do what you’re suggesting with altering the threading but I would like to try if it would help narrow this down. Can you give me any pointers specifically on how that would be done in the context of a normal MPESynthesiser derived synth?

Thanks again for all the help guys. Any other ideas or suggestions are appreciated.

Well it definitely seems to be related to how the load is split among cores, though not how I was thinking exactly.

I mentioned Reaper can run it perfectly at 512 samples as long as it is playing back pre-recorded MIDI. But when I am auditioning/recording live it struggles as well unless at 8192 samples.

I checked HWMonitor to compare what’s happening in each condition. When it’s working smoothly on pre-recorded MIDI, it’s splitting the processing up among all the cores so none is going higher than 10-15%. When I activate auditioning or recording it hammers one core with 85% and then it starts glitching except at super high latency.

This is still far less efficient than the standalone. The standalone can handle much more than this at the lowest latency. But maybe that’s to be expected just due to the inefficiency of a DAW juggling things?

Ie. If there’s nothing wrong with the synth and this is all just due to differences in the ways different DAWs spread the burden of processing there may not be any solution. At least Reaper can playback properly at normal latencies though…

Maybe this is something other people aren’t running into because most people aren’t writing synths that require this much CPU to run?

Have you tried building the plugin as VST2? (It’s still possible to do if you can find the needed header files from somewhere. You just can’t distribute the plugin publicly if you didn’t sign the Steinberg license agreement before October 2018.) This could help to clear up if the CPU consumption/latency issue is because of something in the Juce VST3 implementation.

Have you ensured Cubase and Reaper otherwise work OK on your system with their built-in plugins and other 3rd party plugins?

You might want to try the VST3 build in other hosts. (Like the Juce Plugin Host example or Plogue Bidule.) It could be there’s some similar issue in both Cubase and Reaper that makes your plugin not work correctly.

Is your Juce version up to date?

Thanks for the further ideas Xenakios. You’re always very helpful. I tried using the JUCE VST3 host and it works perfectly, exactly the same as Standalone. I can run it at the lowest soundcard (32 samples) and host settings (265 samples) with absolute perfect efficiency.

I also tried adding the VST2 files from here to test if it is a VST3 issue:

I was able to build a VST2 version which worked in Cubase and Reaper. It didn’t work in Ableton. I’m guessing this is because Ableton does not have full support for MPE. I have a “Legacy mode” in my synth for non-MPE midi devices but it still didn’t work. Either way, Cubase and Reaper continued to behave the same way with the VST2 as the VST3. Ie. Cubase required 8192 samples to function smoothly in any respect, and Reaper could handle 512 samples on playback but needed 8192 on recording/auditioning.

As far as I can tell all my other plugins work perfectly normally in Reaper and Cubase. However, it’s not really a fair comparison because of course most average plugins aren’t using even 5% of what my synth is using.

I don’t know why the standalone and JUCE plugin host can handle it perfectly but Reaper can only handle it selectively and Cubase not really at all. If there is no solution, I will stick with Reaper then and just deal with the buggy recording/auditioning by dropping the settings (# of bandpasses) precipitously during recording and then putting them back up during playback. Perhaps that will be a usable workaround. Seems crazy that I should have to at all though. What’s the point of having multicore CPUs with so much power if you can’t even use them because the DAWs don’t know what to do with it?

I’m going to try Cakewalk, Traktion, and Bitwig. That should cover all the Windows MPE compatible options listed here:

https://support.roli.com/support/solutions/articles/36000037202-compatible-synths-daws-and-instruments

Hopefully one of them knows how to manage CPU from multiple cores properly, if that is seeming like the issue.

Any other thoughts? Does this make sense on any level why the JUCE standalone & VST host can manage perfectly down to almost real time latency but the DAWs struggle?

Did you notice the comment on KVR that you may be building and running in debug mode, after all? Your profiler screenshot shows the debug version of the Microsoft runtime DLL. (I should have noticed that myself earlier…It could of course be even Cubase using it, who knows what kind of craziness they have in that, but it’s still worth checking really carefully what build of your plugin you are running.)

I ran the profiler in Debug mode because that’s the only way I could see to get the detailed info from the Profiler, at least when I tested it on the Standalone. So I presumed I ought to do the same on the VST. I am most definitely not running it in Debug to use as a VST3/VST2. :slight_smile: I couldn’t get it to playback anything at all if I was. The Debug mode can only handle maybe 8-10 bandpasses per voice at most due to how inefficient it is. I’m running 100-200 per voice in Release mode whether Standalone/VST. That would make life too easy I think.

OK, I am almost out of guesses then. Just a couple more :

  • Could there be a denormals issue? (Are you using the Juce ScopedNoDenormals to avoid denormals?)
  • Are you using your own worker threads for the audio calculations? If yes, maybe there’s some kind of synchronization or other issue that only happens when running the plugin in Cubase, Reaper etc. (Those already multithread themselves, so plugins adding more threads into the mix does not necessarily end up with great results.)

Thanks again. I think I am using ScopedNoDenormals correctly. In PluginProcessor.cpp, my processBlock function is:

void AudioPlugInAudioProcessor::processBlock (AudioBuffer<float>& buffer, MidiBuffer& midiMessages)
{
    const ScopedLock renderLock(lock);
    ScopedNoDenormals noDenormals;
    buffer.clear();   
    mMpeSynth.renderNextBlockCustom(buffer, midiMessages, 0, buffer.getNumSamples());
}

I don’t think I’m using my own worker threads since I don’t know how to actually do that. I haven’t changed anything to that level in the MPE Synthesiser structure. That would involve creating threads and controlling their priorities like this I presume?

Maybe learning how to do that would actually fix the problem, if the issue is Reaper/Cubase aren’t threading correctly. It’s over my head now but if it might solve something I could try to learn.

For example, when Reaper is working well (512 samples latency smoothly on playback) it is distributing the load across all my 16 CPU cores evenly. This implies it is breaking it into at least 12-16 threads, right? Ie. One per core? When the standalone/JUCE host work well, they are distributing it across 1-2 cores from what I see (sometimes one, sometimes two) which would imply at least two threads, right? But when Reaper struggles (during audition/recording) it seems to be trying to fit the whole thing on one core (all the other cores go quiet), ie. one thread? Cubase isn’t working well at all so it’s hard to judge from that one.

If I can figure out how Reaper or the JUCE VST host is managing threads so well on this and then hardcode that into the synth, might that fix the problem?

The host is not responsible for your background threads. The host will call your processBlock from it’s audio thread.
It might use for different tracks different threads, but usually each track on it’s own has one thread, since each plugin depends on the previous plugin, so that multiple threads are useless.

The differences you see can have all kinds of reasons:

  • general background load of the machine,
  • how many threads are already running
  • what priority the audio thread is started (in relation to what priority you chose in your background threads)
  • pre-buffering? AFAIK reaper will preprocess up to 20 secs for recorded tracks, which is why you observed reaper to outperform anything else with recorded material

The circumstances are so different, that it is hard to tell, what measure you should take to improve things. There are many things you could have done wrong, which we cannot know without seeing the code.

If your processBlock waits for background threads, that is called priority inversion, and your processBlock will effectively be as slow as the slowest of your background threads. Chances are, you create more overhead than you actually gain.

Could you be receiving extra MIDI messages from those particular hosts, that cause additional load compared to the stand alone build?

Adding your own new threads wouldn’t likely help anything at all, it would probably only make things even worse.

Thanks @daniel. That’s actually what I’m wondering now though. Have I done anything wrong at all or are the DAWs just poor at juggling a high CPU intensity synthesiser? The fact that it works perfectly in Standalone and the JUCE VST host at extremely low latency would suggest to me that maybe I didn’t do anything wrong at all.

You were right about Reaper though. I turned off the “anticipatory processing” and it glitches out, so that’s why it was handling it better.

What is different about the JUCE VST host vs. the DAWs that might explain this?

As for code, I don’t want to share my whole project for obvious personal preference reasons, but I do have the basic synth architecture here:

That was a “sandbox” project I made when I was first learning C++ and JUCE to learn how to put together a basic nested FlexBox GUI (which you daniel gave me the first idea for how to do) and voices and plugin processing. I have continued to experiment with it for testing purposes since. Fair warning: It’s extremely rudimentary, has improperly designed envelopes, is strictly sine based (no bandpasses), and it sounds terrible. I just made a few quick cleanup changes so it might be more tolerable.

But everything I built over the years since has been off that basic parameter/processing/GUI architecture. Although the voices are now vastly more complex and I added many other changes in the rendering (like recently allowing voices to feedback on one another sample-by-sample, etc.) I have always had this DAW latency issue as I have gone along with any high CPU usage settings. So I presume if there is something “wrong” to explain it in the design, it should be in there still. I only started playing with changing the rendering methods etc. very recently and this DAW latency issue was still there before I did that.

A method for “fixed block processing” is in the PluginProcessor.cpp as that was suggested by someone else as a possible fix for the DAW latency issue but it didn’t help so I’m not using that. I just de-activated it in the “sandbox” as well (made no difference either way).

Would you be willing to click through a few places like the PluginProcessor.cpp, MPESynthesiser.h, MPESynthVoice.h to see if there’s anything obvious in the way it’s set up?

@Xenakios, I can try to test that possibility by running a Debug with a DBG output for MIDI data. I’ll try a bit later. If you can also look at that basic simple project as well and tell me if there’s something fundamentally flawed in the way it’s structured I would appreciate it too.

Thanks again guys. I guess the bottom line question is whether or not I have done something wrong, and if I have done something wrong, why does it still work perfectly in the JUCE host and standalone?

I have the EXACT same issue! It drove me crazy for a week!!!
In my case the problem (and therefore the solution) was super trivial:
For some reason the method prepareToPlay is being called with a way bigger samplesPerBlock number than the buffer in the processBlock method. So in my case prepareToPlay receives 528 samples per block, but then when I receive the processBlock I only have to fill the first 48 samples.

Since I had some logic that calculated the whole 528 frames, only a bigger buffer allowed me to play correctly.

My hypothesis is that prepareToPlay receives the biggest number of samples that could be requested, so cubase can use it for playback, but then the processBlock can have a small buffer if required, for example for “realtime” performance.

I hope this helps! (Either to you or any other person looking for an answer)

Your hypothesis can be backed up with the documentation of prepareTpPlay():

The maximumExpectedSamplesPerBlock value is a strong hint about the maximum number of samples that will be provided in each block. You may want to use this value to resize internal buffers. You should program defensively in case a buggy host exceeds this value. The actual block sizes that the host uses may be different each time the callback happens: completely variable block sizes can be expected from some hosts.

2 Likes

Thanks for clarifying that there truly was a problem here and it is fixable. I’ve just come back to this to try to fix it and I’m not sure I fully understand what I need to do here.

How do I actually fix this?

This is what I have as my basic code.

AudioProcessor.cpp:

void AudioPlugInAudioProcessor::prepareToPlay (double sampleRate, int samplesPerBlock)
{
    // Use this method as the place to do any pre-playback
    // initialisation that you need..

	ignoreUnused(samplesPerBlock);

	if (lastSampleRate != sampleRate) {

		lastSampleRate = sampleRate;
		mMpeSynth.setCurrentPlaybackSampleRate(lastSampleRate);
	}
}

void AudioPlugInAudioProcessor::processBlock (AudioBuffer<float>& buffer, MidiBuffer& midiMessages)
{
	const ScopedLock renderLock(lock);

    ScopedNoDenormals noDenormals;

	buffer.clear();
	   
	mMpeSynth.renderNextBlockCustom(buffer, midiMessages, 0, buffer.getNumSamples());
}

MPESynthesiser.h:

void renderNextBlockCustom(AudioBuffer<float>& outputAudio,
		const MidiBuffer& inputMidi,
		int startSample,
		int numSamples)
	{
		MPESynthesiser::renderNextBlock(outputAudio, inputMidi, startSample, numSamples);
				
		//...custom block based output processing, eg:
		if (delayOnOff) {
			monoDelay.renderNextBlockMono(outputAudio, delayTime, delayFeedback, delayPrePostMix, delayDWMix);
		}
	}

So how and where would I fix this? Do I need to do something with samplesPerBlock in prepareToPlay (double sampleRate, int samplesPerBlock) in order to resize some buffers or something? Rather than just using the function ignoreUnused(samplesPerBlock);? I got this code from a tutorial when I was first starting but now I realize I’m basically completely ignoring this data samplesPerBlock and I obviously need to use it somewhere.

ie. Do I need to somehow resize the buffer used in processBlock so it matches samplesPerBlock and I’m processing the right number of samples per block?

I just now tried storing it as a variable and using it to dictate the number of samples to process like this:

void AudioPlugInAudioProcessor::prepareToPlay (double sampleRate, int samplesPerBlock)
{
    // Use this method as the place to do any pre-playback
    // initialisation that you need..

	prepareToPlaySamplesPerBlock = samplesPerBlock;

	if (lastSampleRate != sampleRate) {

		lastSampleRate = sampleRate;
		mMpeSynth.setCurrentPlaybackSampleRate(lastSampleRate);
	}
}

void AudioPlugInAudioProcessor::processBlock (AudioBuffer<float>& buffer, MidiBuffer& midiMessages)
{
	const ScopedLock renderLock(lock);

    ScopedNoDenormals noDenormals;

	buffer.clear();
	
	mMpeSynth.renderNextBlockCustom(buffer, midiMessages, 0, prepareToPlaySamplesPerBlock);
}

Is that the right idea?

Or what if anything am I supposed to do with samplesPerBlock from prepareToPlay?

1 Like