Thread responsiveness differences between M1 and Intel

In a plugin, I’ve got a background thread that does some computational heavy computations on some data pushed to a lock free queue from the processing callback. Ideally, I want the computation to be performed as soon as possible, so in order to evaluate the usual duration of a computation I created a first test case that looked like this (simplified)

void run() override
{
    while (! threadShouldExit())
    {
        inputQueue.popNewData (localBuffer);
        if (localBuffer.hasEnoughData())
        {
            performanceCounter.start();
            theComputationalHeavyThing (localBuffer);
            performanceCounter.stop();
            localBuffer.clear();
        }
}

So this is a super dumb busy poll loop, trying to pop data from the queue into a local buffer and process it as soon as there is enough data. As expected this brings up the systems CPU meter to 100% and the profiler reveals that most time is spent polling the input queue which has no new data yet. But I see theComputationalHeavyThing generating a consistent execution time measured by the performance counter.

In order to bring down the CPU impact, I modified the code like that

void run() override
{
    while (! threadShouldExit())
    {
        inputQueue.popNewData (localBuffer);
        if (localBuffer.hasEnoughData())
        {
            performanceCounter.start();
            theComputationalHeavyThing (localBuffer);
            performanceCounter.stop();
            localBuffer.clear();
        }
        else
        {
            sleep (1);
        }
}

On Intel CPU equipped Macs and Windows machines, this results in a huge drop on the systems CPU meter, since the implementation doesn’t spend all its time waiting for the queue to contain new data while keeping the measured execution time of my function of interest at the exact same consistent level as before. Great!

On ARM equipped Macs however, I see the same drop in CPU load but at the same time see the measured execution time increasing by factor 5 compared to the busy loop solution. As I read a bit about ARM Macs tending to schedule work to efficiency cores rather than performance cores, I played around quite a bit with thread priority options, declaring the thread priority as highest or even using startRealtimeThread and passing various workDuration values as an option – without any change.

I’m not surprised in general by seeing a slight execution time increase, I assume e.g. quite a few cache misses to occur if the thread got scheduled back. Still I’m a bit irritated by seeing such a huge increase in execution time of a single function depending on if it’s called from within a busy loop or after a short sleep interval. Unfortunately I don’t have an ARM Mac myself to really dig deep into analysis of where the performance difference might come from. But maybe someone round here has experienced the same and knows a bit about scheduling and hardware architecture on ARM Macs and what could be an explanation and at best what could improve the situation in this case?

IMHO, makes sense when it comes to saving energy. The only measurable amount of performance demand that determines on which core and at what clock speed the code is executed is the proportional actual working time of the cpu, and this is of course lower when the thread is sleeping most of the time. So the scheduler has a latency, and looks if the workload can already be done on an efficient core, only if the load lasts longer it is moved to a performance core.

It would be interesting to now, if specialised waiting options like std::condition_variable::wait (aka WaitableEvent) behave differently, but I don’t think the scheduler makes exceptions.

I don’t have a lot to offer, except if you have to sleep() during a high performance workload, you’re doing something wrong.

Yes, even on M1 Arm.

But its hard to tell you why you’re doing it wrong without access to the inputQueue and localBuffer classes, to determine whats going on between those calls.

a lock free queue

This is an interesting claim. Is there a CAS operation (compare and swap) at the base of this class?

Shouldn’t you be doing this (?):

void run() override
{
    while (! threadShouldExit())
    {

        if (localBuffer.hasEnoughData())
        {
            performanceCounter.start();
            theComputationalHeavyThing (localBuffer);
            performanceCounter.stop();
            localBuffer.clear();
        }
        else
        {
            inputQueue.popNewData (localBuffer);
        }
}

I think this is more or less the same as my initial approach

void run() override
{
    while (! threadShouldExit())
    {
        inputQueue.popNewData (localBuffer);
        if (localBuffer.hasEnoughData())
        {
            performanceCounter.start();
            theComputationalHeavyThing (localBuffer);
            performanceCounter.stop();
            localBuffer.clear();
        }
}

So how would you handle it? The realtime thread will push a bunch of samples to the queue with every process callback. At best the worker thread should wait until there are enough samples, empty the queue in one run and then do its processing as soon as possible. The problem here is making the worker thread wait for a certain event (in this case enough samples having been pushed to the queue) from the realtime thread in a way that does not involve system calls invoked from the realtime thread. E.g. a WaitableEvent the worker thread waits on and being signalled from the real time thread is not realtime safe to my knowledge. So the best thing I can think of is making the worker thread poll the queue to find out if data has arrived to work on. Polling in a loop without waiting burns a lot of CPU, so adding the shortest possible sleep in case there wasn’t enough data yet to execute the workload and then try again seems like the best option I came up with.

As I said, this is just more or less pseudo code, the real implementation is a bit more complex. But I can tell you that the underlying queue implementation we use is a moodycamel::readerwriterqueue. We use this queue a lot in our code whenever passing data from one thread to another is needed but I haven’t dived deeply into the implementation myself so far. I always wanted to find some time to investigate the field of lock free queue implementations a bit more, but haven’t found the time to do so.

So, I’m open to proposals for different architectures to solve the general problem if there are any that I don’t have on my radar.

I think this is more or less the same as my initial approach

Well … not really, no. There is a subtle performance difference to be had between always checking if there is new data to pop off the queue (a checking operation that seems expensive and therefore should only be being done when absolutely necessary), and only doing it when there isn’t enough work to be done … since its a ‘slow’ operation, only checking for new queue contents when you really need to (i.e. when you’re not doing the heavy processing) is the more effective way of doing things.

Take a closer look - you’re incurring a ‘useless’ operation, and potentially introducing a bug (what if there is too much data after your initial popNewData() call?) by always checking for queue contents. Instead, only checking for queue contents when you’re not otherwise ready to do the heavy processing, is just more efficient - and your compiler will have an easier time optimizing this, too.

So how would you handle it? The realtime thread will push a bunch of samples to the queue with every process callback. At best the worker thread should wait until there are enough samples, empty the queue in one run and then do its processing as soon as possible.

That’s what my code does, but its not what your code does. Your code checks the queue whether or not it has work to do, and that is inefficient, since the real work only happens when there is enough data … better to do the work if there is enough data, or check the queue and add data if there isn’t - not both operations, no matter what …

The problem here is making the worker thread wait for a certain event (in this case enough samples having been pushed to the queue) from the realtime thread in a way that does not involve system calls invoked from the realtime thread.

Thats what the logic (“if localBuffer is ready - do the work, otherwise, check the queue for more data”) does. If you’re always checking the queue, whatever semantics are that are protecting the queue from the other thread, are going to add useless time spent to your outer while() loop.

So the best thing I can think of is making the worker thread poll the queue to find out if data has arrived to work on.

Only poll when you don’t have any work to do. Don’t poll when you (potentially) already have the localBuffer ready to process. Unnecessary poll()'ing is the bollocks, here.

Polling in a loop without waiting burns a lot of CPU, so adding the shortest possible sleep in case there wasn’t enough data yet to execute the workload and then try again seems like the best option I came up with.

In realtime, its a huge smell to encounter sleep()'s intended to keep two threads joined with a queue from interfering with each other. This indicates you haven’t quite got the semantics of your thread interaction quite sorted.

EDIT: I checked the readerwriterqueue.h code, and indeed you are potentially triggering a queue resize event by forcefully popNewData()'ing during every iteration of the while() loop. If your data is ready when the queue size is 100, and you’ve told readerwriterqueue that you expect that queue to be 100 in size, then when you forcefully popNewData() onto a ‘ready’ buffer, it is 101 in size - and readerwriterqueue will re-allocate, costing you performance.

This problem goes away if you only popNewData() when there isn’t enough work to be done. You won’t hit the queue size limit, since the queue will be emptied as soon as its full enough to a) proceed with the heavy processing and b) that processing completes …

I really appreciate your will to give an in-depth answer here, but I think it’s missing the point, sorry :confused:

As already mentioned, what you see above is not the real implementation. That popNewData function is pseudocode, the real code does a lot of checking if there is any data available and will only pop the number of elements needed to append them to the local buffer until it holds the desired amount of elements. And even that local buffer is more a ring buffer thing and the actual computational heavy processing is done on overlapping blocks with all kinds of checks in place that will never lead to the unintended resizing and potential data loss that you spotted here.

And I also know for certain that there will be new data in the queue after a successful run of my workload, simply because that operation takes long enough. So checking the queue for new data right after the computation makes sense here, because I’m nearly 100% sure that the audio thread will have pushed new data to the queue in between.

In your code after the execution of the if branch the while loop would continue, skip the if branch that time and execute the else branch. This will happen a few hundred times where probably only a few calls to the else branch will lead to data being popped from the queue and all the others will just do nothing because the queue is still empty. Finally there is the point where enough data has been appended to the local buffer and the if branch is taken again. There is no chance that the if branch is taken if the else branch has not been taken a couple of time and due to the timing situation there is no chance that the pop attempt is successful more often that it is unsuccessful, leading to a lot of CPU being spent of trying to find out that there still is no new data in the queue to pop.

So I come to the conclusion that your solution is basically really nothing else than always doing that pop operation.

I probably should have named that function trypPop in that example code to make that point more clear.

I think I disagree. If I know that the worker thread will spend a great amount of its time unsuccessfully attempting to read data that has not been produced yet it should better wait a bit until that has happened instead. Given the fact that there is no realtime safe way to wake up the worker consumer thread from the realtime producer thread once the producer knows that that condition is met – at least none to my knowledge – making the worker thread sleep for a short amount of time after an unsuccessful pop attempt and then trying again in the hope that new data has been produced in the meantime really seems like the best approach possible in that situation.

With the use of a sleep() call, you are literally throwing away processing power. Either you’re writing a high-power processing method, or you’re writing a sleep()'er. :wink:

So I come to the conclusion that your solution is basically really nothing else than always doing that pop operation.

Sorry, but no. In my code the pop operation to create localBuffer only happens when there isn’t actually any work to be done. And the work operation only happens once sufficient pop operations have been done.

It is your code that always does the pop operation, no matter what/when … and this is a waste in the case where localBuffer is ready - and potentially even more catastrophic in the case where localBuffer is ready but you’re going to put new stuff into it, anyway - your queue class will resize!

Please, I urge you to take a closer look. You are committing one of the cardinal sins of realtime programming by using sleep()'s to slow down your worker threads. Those things should be running like a bat out of hell … which they are, with my slight tweak - and are not, with your original code.

There are three states to consider: 1) Not enough data, pop data into localBuffer if its available, 2) not enough data in localBuffer and none available through the pop, 3) enough data in the localBuffer, so do the work.

In other words, 1) is “prepare”, 2) is “idle”, and 3) is “process”. Ideally you want to switch between 1)->3) as frequently and efficiently as possible. If you think you hit 2) too often, then congratulations - you’ve written a highly efficient processing thread. The logjam is further upstream.

I still disagree to some extent. First of all to avoid confusion, I’m comparing your code and my version without the sleep.

Just to make that clear once more, behind the popNewData call there is something like that

void popNewData (BufferType& buf)
{
    auto numElementsMissingInBuffer = requiredNumElements - buf.size();
    auto numElementsToPop = std::min (numElementsMissingInBuffer, queue.getNumElementsReady());
    
    for (auto i = 0; i < numElementsToPop; ++i)
        buf.append (queue.dequeue());
}

Furthermore expect the buffer type to be some preallocated data structure where appending and clearing won’t lead to any memory allocations. Neither will my code append data to the buffer that won’t be processed nor will my code attempt more data from the queue than available.


In both code examples the if branch can only be taken after one or likely multiple successful pop operations. There will be no scenario where the work is executed two times successively without at least one successful pop operation because the buffered data is discarded after having worked with it in every case so we need at least one pop operation in between. Do you agree with me so far?

Now let’s make up a few theoretical scenarios. For the sake of the example we assume that the local buffer must hold 10 elements in order to be ready to be processed. In the tables below the horizontal separators mark a loop iteration

Scenario 1: The always filled source queue

Let’s first assume that there are always a lot more than 10 elements in the source queue. Here is what your solution would do

  | n queue | n buf   | code being executed 
---------------------------------------------------------------------------------
1 | inf     | 0       | if (localBuffer.hasEnoughData()) -> branch not taken
2 | inf     | 0 -> 10 | else {  inputQueue.popNewData (localBuffer); } -> 10 elements popped to buf
---------------------------------------------------------------------------------
3 | inf     | 10      | if (localBuffer.hasEnoughData()) -> branch taken
4 | inf     | 10      | theComputationalHeavyThing (localBuffer);
5 | inf     | 10 -> 0 | localBuffer.clear();
---------------------------------------------------------------------------------
6 | inf     | 0       | if (localBuffer.hasEnoughData()) -> branch not taken
7 | inf     | 0 -> 10 | else {  inputQueue.popNewData (localBuffer); } -> 10 elements popped to buf
---------------------------------------------------------------------------------
8 | inf     | 10      | if (localBuffer.hasEnoughData()) -> branch taken
9 | inf     | 10      | theComputationalHeavyThing (localBuffer);
10| inf     | 10 -> 0 | localBuffer.clear();

Here is what my code would do

  | n queue | n buf   | code being executed 
---------------------------------------------------------------------------------
1 | inf     | 0 -> 10 | inputQueue.popNewData (localBuffer);  -> 10 elements popped to buf
2 | inf     | 10      | if (localBuffer.hasEnoughData()) -> branch taken
3 | inf     | 10      | theComputationalHeavyThing (localBuffer);
4 | inf     | 10 -> 0 | localBuffer.clear();
---------------------------------------------------------------------------------
5 | inf     | 0 -> 10 | inputQueue.popNewData (localBuffer);  -> 10 elements popped to buf
6 | inf     | 10      | if (localBuffer.hasEnoughData()) -> branch taken
7 | inf     | 10      | theComputationalHeavyThing (localBuffer);
8 | inf     | 10 -> 0 | localBuffer.clear();

Scenario 2: The mostly filled source queue

Let’s assume that elements are constantly added to the queue so that no pop operation is completely unsuccessful. I just made up a random increase in the number of elements in the queue below that just grow in a way that the thread constantly has some work to do. Your code:

  | n queue | n buf   | code being executed 
---------------------------------------------------------------------------------
1 | 5       | 0       | if (localBuffer.hasEnoughData()) -> branch not taken
2 | 5 -> 0  | 0 -> 5  | else {  inputQueue.popNewData (localBuffer); } -> 5 elements popped to buf
---------------------------------------------------------------------------------
3 | 3       | 5       | if (localBuffer.hasEnoughData()) -> branch not taken
4 | 5 -> 0  | 5 -> 10 | else {  inputQueue.popNewData (localBuffer); } -> 5 elements popped to buf
---------------------------------------------------------------------------------
5 | 2       | 10      | if (localBuffer.hasEnoughData()) -> branch taken
6 | 6       | 10      | theComputationalHeavyThing (localBuffer);
7 | 8       | 10 -> 0 | localBuffer.clear();
---------------------------------------------------------------------------------
8 | 11      | 0       | if (localBuffer.hasEnoughData()) -> branch not taken
9 | 12 -> 2 | 0 -> 10 | else {  inputQueue.popNewData (localBuffer); } -> 10 elements popped to buf
---------------------------------------------------------------------------------
10| 3       | 10      | if (localBuffer.hasEnoughData()) -> branch taken
11| 6       | 10      | theComputationalHeavyThing (localBuffer);
12| 7       | 10 -> 0 | localBuffer.clear();
---------------------------------------------------------------------------------
13| 8       | 0       | if (localBuffer.hasEnoughData()) -> branch not taken
14| 8 -> 0  | 0 -> 8  | else {  inputQueue.popNewData (localBuffer); } -> 8 elements popped to buf
---------------------------------------------------------------------------------

My code:

  | n queue | n buf   | code being executed 
---------------------------------------------------------------------------------
1 | 5 -> 0  | 0 -> 5  | inputQueue.popNewData (localBuffer); -> 5 elements popped to buf
2 | 3       | 5       | if (localBuffer.hasEnoughData()) -> branch not taken
---------------------------------------------------------------------------------
3 | 5 -> 0  | 5 -> 10 | inputQueue.popNewData (localBuffer); -> 5 elements popped to buf
4 | 2       | 10      | if (localBuffer.hasEnoughData()) -> branch taken
5 | 6       | 10      | theComputationalHeavyThing (localBuffer);
6 | 8       | 10 -> 0 | localBuffer.clear();
---------------------------------------------------------------------------------
7 | 11 -> 1 | 0 -> 10 | inputQueue.popNewData (localBuffer); -> 10 elements popped to buf
8 | 2       | 10      | if (localBuffer.hasEnoughData()) -> branch taken
9 | 6       | 10      | theComputationalHeavyThing (localBuffer);
10| 7       | 10 -> 0 | localBuffer.clear();
---------------------------------------------------------------------------------
11| 8 -> 0  | 0 -> 8  | inputQueue.popNewData (localBuffer); -> 8 elements popped to buf
12| 0       | 8       | if (localBuffer.hasEnoughData()) -> branch not taken
---------------------------------------------------------------------------------

Scenario 3: The not always filled source queue

This is the scenario that is closest to the real world conditions I’m experiencing. There are times where the queue is not being filled up again and the worker thread has to wait for new data.

Your code:

  | n queue | n buf   | code being executed 
---------------------------------------------------------------------------------
1 | 0       | 0       | if (localBuffer.hasEnoughData()) -> branch not taken
2 | 0       | 0       | else {  inputQueue.popNewData (localBuffer); } -> idle, no elements popped to buf
---------------------------------------------------------------------------------
3 | 0       | 0       | if (localBuffer.hasEnoughData()) -> branch not taken
4 | 0       | 0       | else {  inputQueue.popNewData (localBuffer); } -> idle, no elements popped to buf
---------------------------------------------------------------------------------
5 | 2       | 0       | if (localBuffer.hasEnoughData()) -> branch not taken
6 | 3 -> 0  | 0 -> 3  | else {  inputQueue.popNewData (localBuffer); } -> 3 elements popped to buf
---------------------------------------------------------------------------------
7 | 0       | 3       | if (localBuffer.hasEnoughData()) -> branch not taken
8 | 0       | 3       | else {  inputQueue.popNewData (localBuffer); } -> idle, no elements popped to buf
---------------------------------------------------------------------------------
9 | 0       | 3       | if (localBuffer.hasEnoughData()) -> branch not taken
10| 0       | 3       | else {  inputQueue.popNewData (localBuffer); } -> idle, no elements popped to buf
---------------------------------------------------------------------------------
11| 5       | 3       | if (localBuffer.hasEnoughData()) -> branch not taken
12| 8 -> 1  | 3 -> 10 | else {  inputQueue.popNewData (localBuffer); } -> 7 elements popped to buf
---------------------------------------------------------------------------------
13| 1       | 10      | if (localBuffer.hasEnoughData()) -> branch taken
14| 7       | 10      | theComputationalHeavyThing (localBuffer);
15| 7       | 10 -> 0 | localBuffer.clear();
---------------------------------------------------------------------------------
16| 7       | 0       | if (localBuffer.hasEnoughData()) -> branch not taken
17| 7 -> 0  | 0 -> 7  | else {  inputQueue.popNewData (localBuffer); } -> 7 elements popped to buf
---------------------------------------------------------------------------------
18| 0       | 0       | if (localBuffer.hasEnoughData()) -> branch not taken
19| 0       | 0       | else {  inputQueue.popNewData (localBuffer); } -> idle, no elements popped to buf
---------------------------------------------------------------------------------

My code:

  | n queue | n buf   | code being executed 
---------------------------------------------------------------------------------
1 | 0       | 0       | inputQueue.popNewData (localBuffer); -> idle, no elements popped to buf
2 | 0       | 0       | if (localBuffer.hasEnoughData()) -> branch not taken
---------------------------------------------------------------------------------
3 | 0       | 0       | inputQueue.popNewData (localBuffer); -> idle, no elements popped to buf
4 | 0       | 0       | if (localBuffer.hasEnoughData()) -> branch not taken
---------------------------------------------------------------------------------
5 | 2       | 0 -> 2  | inputQueue.popNewData (localBuffer); -> 2 elements popped to buf
6 | 1       | 2       | if (localBuffer.hasEnoughData()) -> branch not taken
---------------------------------------------------------------------------------
7 | 1 -> 0  | 2 -> 3  | inputQueue.popNewData (localBuffer); -> 1 element popped to buf
8 | 0       | 3       | if (localBuffer.hasEnoughData()) -> branch not taken
---------------------------------------------------------------------------------
9 | 0       | 3       | inputQueue.popNewData (localBuffer); -> idle, no elements popped to buf
10| 0       | 3       | if (localBuffer.hasEnoughData()) -> branch not taken
---------------------------------------------------------------------------------
11| 8 -> 1  | 3 -> 10 | inputQueue.popNewData (localBuffer); -> idle, no elements popped to buf
12| 1       | 10       | if (localBuffer.hasEnoughData()) -> branch taken
13| 7       | 10      | theComputationalHeavyThing (localBuffer);
14| 7       | 10 -> 0 | localBuffer.clear();
---------------------------------------------------------------------------------
15| 7 -> 0  | 0 -> 7  | inputQueue.popNewData (localBuffer); -> 7 elements popped to buf
16| 0       | 7       | if (localBuffer.hasEnoughData()) -> branch not taken
---------------------------------------------------------------------------------
17| 0       | 7       | inputQueue.popNewData (localBuffer); -> idle, no elements popped to buf
18| 0       | 7       | if (localBuffer.hasEnoughData()) -> branch not taken
---------------------------------------------------------------------------------

My conclusion

I cannot really see how my approach is so much worse than yours comparing them in the example cases above. With the knowledge that after every successful execution of the workload we need to pop some data before another successful run is possible, I see no resources being wasted when always polling the queue before checking again if there are now enough samples.


Anyway, I think the interesting point is actually the last thing that you mentioned

As written in my initial post, indeed a lot of time is spent in the idle state where my worker thread just checks again and again if there is new data in the queue. Since the source of my data are audio samples arriving in the realtime processing callback, I cannot change much about the interval and speed at which data is generated. Executing the work directly on the realtime thread is out of question since it does some non-realtime safe stuff and even if we assumed that we could change that, in case of small block sizes the time box for that work would surely exceed the time that’s available to process a single block of audio. So working on the data on a separate thread as soon as there is enough data is the only option here. There will always be a period where new data is collected from the audio thread and then there will be a timepoint where the collected data has to be processed as soon as possible. So again I wonder how else should my worker thread spend its time waiting until there is new data to work on in an energy efficient way when it’s not sleeping for a certain amount of time?

An alternative to using sleep() is to use select() with an fd_set of zero’s. This won’t burn through energy as much as sleep() does, and has the added bonus that you can also use those fd_set productively, for something, if you want to do something productive during the idle phase.

But … why idle in the first place? Do some additional work in that idle state. Maybe your thread can do more than just busy-wait while it waits for its queues to fill?

As far as your analysis is concerned, you know your code and your runtime experience better than anyone. But I’m not sure if you’re really factoring in the performance hit you’re taking by always pop’ing data and then processing it. Real profiling is the only way for you to determine the extent to which you are ‘wasting’ time, or spending too much time in code-paths that don’t do anything … Please remember, the code that you’re looking at, isn’t the code you’re generating - spending a lot of time in an expensive operation can look like you’re handling things as quickly as possible, but in fact if you’re doing inefficient work during that period (queue handling) then its possible you’re seeing your thread perform poorly but misinterpreting it as ‘working as fast as possible’.

EDIT: From Apples Docs - maybe you should investigate pthread_yield_np()?:

Don’t Keep Threads Active And Idle

Keeping a thread active while it tries to acquire a resource might minimize the overhead of switching thread contexts, but at great cost. When you keep a thread active but doing nothing, you prevent a CPU core from doing other work. On Apple silicon, this behavior exacerbates performance issues in producer-consumer algorithms when the consumer thread runs on a p-core and the producer runs on an e-core. Instead, eliminate spin locks and other spin-wait code that causes your thread to hold on to a core. Replace them with an os_unfair_lock, a condition variable, or a standard mutex that lets your thread block.

In addition to avoiding spin locks, avoid pthread_yield_np and equivalent functions that yield the thread’s time to higher-priority threads instead of blocking outright. Yield-related APIs allow the current thread to continue running when the waiting threads have lower priorities. This behavior prevents the system from scheduling some lower-priority threads and doing productive work.

For more information about synchronization primitives for threads, see osframework or the pthreads API.

1 Like

I recommend using a FIFO (juce::AbstractFIFO) and std::condition_variable to do some efficient work submission.

I think everything else has already been mentioned.

The Apple M scheduler is still a relatively unknown beast and a lot of what we do know is from the iOS/iPad platforms. IIRC, a thread that is put to sleep often will wake up in a ‘low-performance’ state (E core or low clocks) and be promoted up through the ranks as it were on to a P core if it runs for long enough/deemed suitable for P cores.

I recommend this ADC talk on the subject.

We’re currently putting together Audiounit Workgroup support which might help improve performance (although you would likely need to restructure your work model).

You could pre-warm (spinlock instead of sleep/std::condition_variable::wait) the thread when your buffer is almost full, removing the delay for when your function is ready to go.

3 Likes

Thank you both for your replies. I’ll be working again on the topic next week and will have an in-depth look at all your suggestions. I’m aware of the audio workgroups, I attended the talk on that topic at the last ADC, but if I remember everything right, this will only be available for AU plugins at the moment, as VST3 and AAX don’t integrate any interface to access the host workgroup, am I correct here?

Do I get you right that you’d use a condition variable to schedule work between realtime audio threads and worker threads? I mean, this would obviously involve calling std::condition_variable::notify_one from the audio thread, which seems like one of the classic things that always should be avoided in realtime safe programing, so this proposal is pretty surprising to me

I believe the general consensus is ‘locks are OK as long as they’re not contested’, that being said, I don’t think explicit locking on the AudioThread is required in this example:

// AudioThread
fifo.write (samples);

if (isBufferFull())
    cv.notify_one();
// WorkThread
std::unique_lock sl{lock};
if (cv.wait_for (sl, std::chrono::milliseconds (1), [this] { return isBufferFull(); }))
{
    doWork();
}

// Or if you want to minimise contention time
const bool ready = [this]
{
    std::unique_lock sl{lock};
    return cv.wait_for (sl, std::chrono::milliseconds (1), [this] { return isBufferFull(); });   
}();

if (ready)
    doWork();

You will need to make sure your FIFO is atomic/lock-free which should ensure write ops are ordered correctly. Not locking when calling notify will potentially (likely) mean you occasionally miss a wake-up on the WorkThread but this is mitigated somewhat by the short wait time and the additional buffer state check.

Of course, take this with a pinch of salt. Nobody can agree on the proper use of conditional_variable::notify and locking, even the C++ standard says No but also Yes.

1 Like

After having read a bit on condition variables, there is still one open question to me. Given that my example now scales to a real world scenario with multiple plugin instances, I guess it would be best to not create a worker thread per plugin instance but rather create a pool of worker threads that equal the number of performance cores on the machine, at least this is one of my takeaways from the brilliant ADC talk that you linked. Given that the plugins run in a host application that will most likely render individual tracks multithreaded, we have a multiple writer, multiple reader situation.

My approach would be one fifo per plugin instance. Instead of enqueuing individual samples to the fifo, I’d rather collect entire blocks of samples on the audio thread and push a whole block of samples to the fifo once there are enough samples and notify the thread pool after that.
Now all worker threads should wait on any of the audio threads having signalled that a sample block is ready, check each fifo and dequeue whichever fifo has data in it, run the processing, then do another run until all fifos are empty again. Of course this would require some synchronisation on the reader side of the fifos since we need to ensure that only one thread at a time reads from a specific fifo.

In the case of a single producer thread and a pool of worker threads, a solution to this would be calling notify_one on a condition variable and then one of the workers would wake up to do the work. I find a lot of examples for this pattern online. What I didn’t find though were examples for the multi writer thing. Is it safe to call notify_one on the same condition variable from multiple threads simultaneously? Or are there notification mechanisms better suited for a multiple writer/multile reader scenario? Or is a high priority worker thread per plugin instance not as bad as I think in the end?

After some more research, std::counting_semaphore seems like a good candidate. The semaphore would represent the number of work items waiting to be processed. Every time a process callback has enqueued some data to work on it could call release. All worker threads would be blocked by an acquire call and start immediately if there is work to do.

Any opinion on that?