FR: Thread-Priority vs Efficiency/Performance Cores

I can see the benefit of the API, but according to the Apple docs:

Important
Your app or plug-in requires no additional work if the only real-time thread it uses is the one that the audio frameworks provide. The audio system automatically joins its real-time threads to an audio device workgroup.

This is the 99% use case for JUCE, and we try not to add any platform-exclusive
features.

I do plan to revisit this after Unicode support is completed.

I’m in the 1%, but maybe this will become more of a common requirement, as in future the number of cores on CPUs is likely to keep going up - so I think it’s important to not leave all that CPU power on the table (64 core processors are a thing).
Maybe that’s the job of plugin formats like CLAP which I heard would allow a DAW to split plugin processing across cores (maybe optionally for each synth voice as I do in my plugin code), however many of us also deliver standalone apps so plugin formats that handle it won’t help.

I totally agree that JUCE should avoid to add platform exclusive features to the API… that’s one reason we use JUCE, so that you can wrap all the platform dependent stuff and hide it in a common API (e.g. the FloatVectorOperations functions). In an ideal world, any real-time/audio priority threads I request in my plugin would automatically be assigned to the appropriate common audio workgroup without me having to think about it (assuming a Mac build) - a similar thing is probably required for the latest Intel chips no doubt - I hope the JUCE API will just transparently do the most appropriate thing for each target system.

DAWs that use JUCE API like Tracktion Waveform will probably need to have this audio workgroups feature enabled too to enable more reliable processing of tracks across multiple cores.

For the moment my plugin’s multithreaded processing works pretty good on Macbook Air M1, and has much worse ! performance on some users M1 Macbook Pros (with only 2 efficiency cores vs 4 on the Air) - I’m pretty sure the reason it’s broken is because high priority threads are getting dropped to e-cores (CPU load graphs from a customer appear to confirm it) - so audio workgroup association would most likely resolve that.

(There are no performance issues with the same multi-threaded processing code implementation on Intel CPUs using same JUCE Thread API - although I don’t know about the very latest generation Intel big/little chips.)

Anyway, keeping my fingers crossed for this future update.

I agree with adding a bunch of platform specific stuff counter acts the main idea of JUCE. But I’d consider to think about an exception here. As far as I can see, there are gonna be two possible scenarios in the near future. With the Mac Studio, Apple suddenly released a real powerhouse of a computer, and having a lot of processing power is suddenly very cheap. It might take some time for every enterprise to figure that out, but I’ve talked to multiple StartUps with different ideas that are gonna utilise this new ability. Being able to unleash that power efficiently with JUCE, would really promote the overall framework.

And with the new Mac Studio on the market, the other companies will either shape up and release a competitive product (including software to make use of that power) which will put JUCE one step ahead of already having the API for that – or they won’t which leaves JUCE with either having to implement it just for MacOS or loosing users, because they can’t access the power of the system they are building for.

Hi,

I did some tests with the new os_workgroup API. And it really improves the scheduling of multiple RT threads on P-cores and E-cores for both worst-case peak and average CPU load on the main audio thread.

It is enough to patch juce_AudioProcessor.h with the following extension

  /**
   MacOS 11 specific.
   
   see https://developer.apple.com/documentation/audiotoolbox/workgroup_management/understanding_audio_workgroups
   
   The system calls this on the render thread,
   immediately before any render request.

   The new workgroup may be null in the case of a nonreal-time
   render context, or a real-time thread that is not part of any
   workgroup.

   pointer_to_workgroup should be cast to an os_workgroup_t in Apple specific client code
   if the pointer is null then the threads should detach from their current workgroup.
   */
  virtual void macOSworkgroupDidChange(void *workgroup = nullptr) {};

and for AUv2 I just had to patch juce_AU_Wrapper.mm with

    ComponentResult GetProperty (AudioUnitPropertyID inID,
                                 AudioUnitScope inScope,
                                 AudioUnitElement inElement,
                                 void* outData) override
    {
        if (inScope == kAudioUnitScope_Global)
        {
            switch (inID)
            {
              case kAudioUnitProperty_RenderContextObserver:
              {
                if(auto *auRenderContextObserver = (AURenderContextObserver*)outData)
                {
                  // Create block safe pointer to processor.
                  __block JuceAU *self = this;

                  *auRenderContextObserver = ^(const AudioUnitRenderContext *context){
                    if (auto renderContext = context) {
                      self->juceFilter->macOSworkgroupDidChange((void*)renderContext->workgroup);
                    }
                    else {
                      /**
                       The new workgroup may be null in the case of a nonreal-time
                       render context, or a real-time thread that is not part of any
                       workgroup.
                       */
                      self->juceFilter->macOSworkgroupDidChange(nullptr);
                    }
                  };
                  return noErr;
                }
                break;
              }

and

    //==============================================================================
    ComponentResult GetPropertyInfo (AudioUnitPropertyID inID,
                                     AudioUnitScope inScope,
                                     AudioUnitElement inElement,
                                     UInt32& outDataSize,
                                     bool& outWritable) override
    {
        if (inScope == kAudioUnitScope_Global)
        {
            switch (inID)
            {
              case kAudioUnitProperty_RenderContextObserver:
                outWritable = false;
                outDataSize = sizeof(AURenderContextObserver);
                return noErr;

I didn’t bother looking at the Standalone wrapper yet. But according to Apple docs the workgroup can be retrieved from the HAL. see Apple Developer Documentation

And hopefully VST3 and AAX will get the same kind of API extensions.

see How to get "OSWorkGroup" on Macs with Apple Silicon - VST 3 SDK - Steinberg Forums

Let me know if you prefer a pull request.

4 Likes

Would be great to have an example for the Standalone Wrapper. This tool might be much more powerful in a standalone application, if the host is already doing a good job multi threading everything.

Small update: I had crashes related to block reference counting in Objective-C (different according to macOS versions…)

So I modified my patch to use the juce helper CreateObjCBlock. Unfortunately, in this context, autorelease does not seem to be the right ref counting policy to adopt either.

So I finally patched juce as follows:

In juce_AU_wrapper.mm

    void renderContextObserverCallback(const AudioUnitRenderContext *context)
    {
      if (context)
      {
        juceFilter->macOSworkgroupDidChange((void*)context->workgroup);
      }
      else
      {
        /**
         The new workgroup may be null in the case of a nonreal-time
         render context, or a real-time thread that is not part of any
         workgroup.
         */
        juceFilter->macOSworkgroupDidChange(nullptr);
      }
    }
  
    ComponentResult GetProperty (AudioUnitPropertyID inID,
                                 AudioUnitScope inScope,
                                 AudioUnitElement inElement,
                                 void* outData) override
    {
        if (inScope == kAudioUnitScope_Global)
        {
            switch (inID)
            {
              case kAudioUnitProperty_RenderContextObserver:
              {
                if(auto *auRenderContextObserver = (AURenderContextObserver*)outData)
                {
                  *auRenderContextObserver = CreateObjCBlockCopy(this, &JuceAU::renderContextObserverCallback);
                  return noErr;
                }

and in juce_mac_ObjCHelpers.h

  template <typename Class, typename Fn, typename Result, typename... Params>
  auto createObjCBlockImplCopy (Class* object, Fn func, Signature<Result (Params...)>)
  {
    __block auto _this = object;
    __block auto _func = func;
    
    return [^Result (Params... params) { return (_this->*_func) (params...); } copy];
  }

...

  template <typename Class, typename MemberFunc>
  auto CreateObjCBlockCopy (Class* object, MemberFunc fn)
  {
    return detail::createObjCBlockImplCopy (object, fn, detail::getSignature (fn));
  }

This patch works fine in both debug and release. So I’m stopping there. But I’m still unsure when reading Apple documentation about the proper ref counting semantics for blocks that are expected to be used here.

HTH

P.S.: I also discovered by trial and error that according to the OS version you are using, it may or may not be a requirement in the scheduler that threads also have real-time priority to be accepted when trying to join a workgroup. (seems like real-time promotion is automatic but not guaranteed on recent MacOS versions)

2 Likes

For those that still didn’t figure this out, here is a gist with all my changes based on the help from @rmuller in the thread above, and other forum posts on the subject - I’ve tested this working with AU plugins on M1 based Mac Air:

There are 3 files in the gist, juce_AudioProcessor.h, juce_AU_Wrapper.mm, and juce_mac_ObjCHelpers.h (it’s uploaded as gistfile1.txt) - I suggest you just diff those files to see the changes, but also see my notes at the top of each file.
I don’t fully understand the changes in juce_mac_ObjCHelpers.h - just taken from rmuller’s example code.

I was stuck for a long time getting un-explained crashes until I realised I was missing the ‘thread_local’ keyword on the os_workgroup_join_token_s structure creation.
My implementation in AudioProcessorHolder class appears to work reliably - and I create and destroy audio thread pools regularly (every time a patch is loaded or my voice count settings change), but I may have missed something… for example I’m not sure if I should do anything with the ‘nullWorkGroupEvent()’ function.

See also the post https://forum.juce.com/t/os-workgroup-join-consistently-returns-einval-cant-join-audio-workgroup/54240

Whilst I added this for AU plugins, I still don’t have a solution for VST3 or standalone.
Anyway, I hope this kind of implementation can be added to JUCE officially because I really don’t want to have to hack in such OS specific code and maintain even more custom bits of the JUCE codebase.

Additional note - threads appear to join the audio workgroup correctly and I noticed better behaviour in terms of not getting pops/clicks, but it does not appear that all processing of threads is automatically moved to P-cores… simply that the OS makes sure to schedule those threads together in a way that means they don’t hold each other up significantly (also this requires that your thread tasks are doing something that is very similar in process time… in my case in most cases I’m processing identical audio chains - one for each voice).
Also, additional note, I saw that ‘startRealtimeThread’ internally did not use the highest priority by default - and when I use that with my standalone build, the process load goes up significantly if the window is no longer the active application window… changing the JUCE code to set the priority to ‘Highest’ made that issue go away - I’m not sure if that classifies as a JUCE bug?

Edit: small update on this - to prevent crash/issue on Intel Mac builds where the workgroups join function will fail (even though there is a render context observer event) - I had to tweak my code to make sure I only store thread/token assignments if the result of os_workgroup_join succeeds:

    int joinCurrentAuWorkGroup(void* threadId) //  call from thread at run, before while loop
    {
        // Join this thread to the workgroup.
        if (@available(macOS 11.0, *))
        {
            thread_local os_workgroup_join_token_s joinToken{};
            
            const int result = os_workgroup_join(currentWorkgroup, &joinToken);
            
            if (result == 0)
            {
                AudioProcessorHolder::WorkgroupAndToken toStore = {currentWorkgroup , &joinToken};
                threadTokenList.emplace_back(threadId, toStore);
                return 1;// Success.
            }
            else if (result == EALREADY)
            {
                // The thread is already part of a workgroup that can't be
                // nested in the the specified workgroup.
                return -1;
            }
            else if (result == EINVAL)
            {
                // The workgroup has been canceled.
                return -1;
            }
        }
        return -1;
    }

For the moment, I am assuming that the Workgroups API will never used for Intel based Macs unless one day Apple decides to go back to Intel silicon chips with the new big/little architecture… so it’s probably safe to just #ifdef out all the workgroups related code with #ifdef JUCE_ARM.

4 Likes

FYI - see MacOS Audio Thread Workgroups - #13 by wavesequencer also for my hacky solution for standalone.

I still don’t know how to retrieve the workgroup or current output device name (from which I could search for the workgroup) for the case of Mac VST3 - as mentioned in that post… I’d appreciate a pointer in the right direction.

Following the release of M2 MBP, can the Juce guys add this to at least the AU version ?

Thanks !

1 Like

Yes - it would be great if both standalone and AU at least provided functions for joining and leaving the current workgroups - as optionally implementable virtual functions which register the joinCurrentWorkgroup and leaveCurrentWorkgroup callback function pointers.
I would not recommend changing any further the JUCE Thread class (since thread-workgroup join/leave are the responsibility of the user on each thread run/exit), simply provide these optional functions and the built in automatic registration of the current audio workgroup within the au wrapper and standalone filter window classes (+ extra thread-workgroup tracking functionality (threadTokenList) for correct workgroup leave request functionality).

BTW - I had a crash when rendering audio with AU plugins because when rendering audio, at least in Ableton Live, the current workgroup becomes NULL - I assume because audio renders are not real-time and not associated to the hardware io thread workgroup… so there needs to be an extra check for that condition:

    int joinCurrentAuWorkGroup(void* threadId) //  call from thread at run, before while loop
    {
        // Join this thread to the workgroup.
        if (@available(macOS 11.0, *))
        {
            if(currentWorkgroup != nullptr) // EXTRA CHECK!
            {
                thread_local os_workgroup_join_token_s joinToken{};
                const int result = os_workgroup_join(currentWorkgroup, &joinToken);
                

‘os_workgroup_join(currentWorkgroup, &joinToken);’ will crash if the ‘currentWorkgroup’ is NULL as unfortunately the Apple API doesn’t check for that case.