Os_workgroup_join consistently returns EINVAL - Can't join Audio Workgroup

Hi all!

I am trying to get os audio workgroups working for my application, but keep running into the same problem, it appears the workgroup is always “Cancelled” when I try to join it.

I can replicate the issue with the code example in the main documentation: Apple Developer Documentation

The code I use can be found here: Error in os_workgroup_join · GitHub Every time it gets to os_workgroup_join it returns EINVAL (and hits the assert)

Has anyone gotten this to work? Am I missing something dumb?

Thanks for the help!!

Closing my own ticket!

https://developer.apple.com/forums/thread/697874

"
Figured this out with the help of the technical Apple support.

The worker thread needs to have a real-time constraint set, before you can join them to the workgroup. This can be done viathread_policy_set specifying THREAD_TIME_CONSTRAINT_POLICY.

This isn’t documented, but it actually makes sense as joining workgroup quite likely changes the thread time constraints.
"

1 Like

There was one additional problem with my example, that is

os_workgroup_join_token_s joinToken{};

// Return true if the method joined the thread to the workgroup.
bool joinThisThreadToWorkgroup(os_workgroup_t aWorkgroup) {
    // Join this thread to the workgroup.
    const int result = os_workgroup_join(aWorkgroup, &joinToken);

The jointoken MUST BE OF TYPE os_workgroup_join_token_s, which you pass in. This was found by searching “Adding Parallel Real-Time threads to Audio workgroups” on the apple dev forums.

I documented the fully working code in a gist, to hopefully save a future user some pain and suffering.

gist: /cjappl/20fed4c5631099989af9ca900db68bfa

There is also a VERY helpful article here that describes some of the things in more detail:

https://justinfrankel.org/?article=854

2 Likes

Thanks for your notes/links here.
FYI - I found that so long as I create threads with the Juce::Thread class then the real-time constraint stuff is already taken care of when using ‘startRealtimeThread’ (as of JUCE 7.0.3) - so that may be useful to know for some folks.
Also, I used os_workgroup_join_token_s joinToken{} - to create a token when registering threads, but it kept crashing… until I added ‘thread_local’ in front.

e.g.
thread_local os_workgroup_join_token_s joinToken{};
const int result = os_workgroup_join(currentWorkgroup, &joinToken);

1 Like

Nice, that’s good to hear you got it working!! I find the docs a little opaque, so it’s good to have some examples.

Have you played around much with the OS audio workgroups? Unfortunately I’m finding in some testing that they aren’t performing as well as I’d hoped. Comparing:

  • N threads promoted to realtime
  • N threads promoted to realtime, in an OS Audio workgroup

The first one has substantially less dropouts and audio problems when pushed to it’s limits. I’m not sure if I’m missing something, or if there is some way to debug or understand what’s going on better.

My experience so far with multiple threads added to the audio workgroup for AU plugins was that it appeared to provide substantially more reliable thread execution times (vs when I ran the same build before adding the workgroup thread registration), and removed some pops and clicks I was seeing before (on Apple silicon only) probably due to some threads holding up the completion of the tasks in my thread pool (because the scheduler demoted their priority at random).

As mentioned in the other JUCE discussion thread, from observation, I don’t think there is automatic promotion to always running the threads on P-Cores - it seems to depend on the overall system load, but I am assuming the Mac OS scheduler is at least keeping all same workgroup threads at the same priority vs dropping some threads to lower priority at random.
So depending on the overall CPU load, the amount of time taken to complete processing in the audio callback can appear to depend on if the thread pool threads were running on e-cores or p-cores. When the load gets higher on the CPU, the actual audio process time may remain the same if the threads then get bumped to P-cores.
I’m only guessing here in terms of how I see the load changing on the cores in the Mac CPU monitor depending on the complexity of the sound patch that I load (where the total number of threads changes depending on the patch).

I have not measured this scientifically yet though, and still need to get some testing time on a M1 Max based CPU system (hoping for some help from beta-testers there).

One thing to be careful of is the realtime thread policy settings - if the number is not set correctly and your threads don’t complete in time (either due to wrong setting or actually some other issue in your processing threads maybe like a lock) that the threads may get demoted to lower priority anyway. (see comments in JUCE code below)

    //==============================================================================
    /** A selection of options available when creating realtime threads.

        @see startRealtimeThread
    */
    struct RealtimeOptions
    {
        /** Linux only: A value with a range of 0-10, where 10 is the highest priority. */
        int priority = 5;

        /* iOS/macOS only: A millisecond value representing the estimated time between each
                           'Thread::run' call. Your thread may be penalised if you frequently
                           overrun this.
        */
        uint32_t workDurationMs = 0;
    };

Interesting, that may indicate that I’m doing something incorrectly!! The few factors that I haven’t quite sorted out yet are:

  • Number of threads in my workgroup (and how to balance that with other threads in my app)
  • The period, computation, and constraint of the realtime params.

For the second, at least, I may try to duplicate what’s happening in JUCE in juce_mac_Threads.mm:

        mach_timebase_info_data_t timebase;
        mach_timebase_info (&timebase);

        const auto periodMs = realtimeOptions->workDurationMs;
        const auto ticksPerMs = ((double) timebase.denom * 1000000.0) / (double) timebase.numer;
        const auto periodTicks = (uint32_t) jmin ((double) std::numeric_limits<uint32_t>::max(), periodMs * ticksPerMs);

        policy.period = periodTicks;
        policy.computation = jmin ((uint32_t) 50000, policy.period);
        policy.constraint = policy.period;
        policy.preemptible = true;

That would at least give me a stable comparison point, instead of just guessing.

The other suggestion that was given to me by Apple devs at ADC this year was to match the period, computation and constraint of the OSX Core audio realtime thread, and I have yet to measure it. That Justin Frankel article above goes into what one person found these values to be, but of course it’d be good to check myself.

If I end up doing some measurements I’ll try to post them to this thread.

Maybe I’m lazy, but I see no reason not to use JUCE::Thread class with ‘startRealtimeThread’ - since it creates the necessary parameters based on a single miliseconds value.

I just set the expected process time to the value in ms based on the samplerate & blocksize (experts chime in please and tell me if this is bad - seems to work ok).
e.g.

int processTimeMaxMs = std::round((1000.0f/sampleRate) * bufferSize);

bool success = startRealtimeThread(RealtimeOptions{10, static_cast<uint32_t>(processTimeMaxMs)});

(Maybe would be better if the JUCE API uses a float value for the duration input.)

If your thread pool is for real-time processing, then no need to measure anything IMO - if your thread pool functions take longer than the available time in the callback then there is going to be a problem anyway (could probably even set this max time to half the callback time). I believe the main point is to give the OS the appropriate cue as to how to balance priorities vs all other system threads based on the expected time (and keep all those workgroup threads at the same priority) - the main thing is to be sure to specify a time that is longer than your threads take to process (otherwise the OS will demote them possibly), but small enough to keep the priority elevated vs other system threads.

You just have to remember, that any time the samplerate or blocksize changes, you need to delete and re-create all your pool threads (to register the new process max time) - it should be OK - since you can do that in ‘prepare to play’ (better not do in the audio callback itself). (Do the same if the HW device selection changes too - since workgroup will likely change. Here’s where I still have a problem with VST3 - maybe it will be possible to retrieve the workgroup from the main audio callback thread - something I need to investigate.)

If for some reason you need to change number of threads during normal audio operation, it’s best to disable real-time multithreaded calculations and maybe mute audio and then launch a thread pool re-allocation as an async function (un-registering threads from their assigned workgroup when they are stopped). This is how I’m doing it. I would advise against creating the threadpool threads anywhere other than in prepare to play/device change or an async function outside of the main audio callback.

One thing to do maybe is to add some instrumentation code (use a build option to remove from final release) that can record timestamps of the main audio-callback start/end and timestamps of the threadpool threads start/end (or deeper yet, the individual audio process functions)… I used Perfetto to draw some traces for each real-timeworker thread from a log created this way - https://ui.perfetto.dev/ it was extremely useful to validate my thread pool was working as expected - maybe the built in Xcode debugger can also provide this visibility, but using my own log I can keep traces simple and only record information for the specific threads/functions I want to track.
(see Plugin editor window slow to open (since JUCE 6.1 possibly?) - #8 by wavesequencer for some example log screenshots)

When creating a threadpool, it may be best to try to limit to the max logical cores (there’s a JUCE function to get that value)… if you have more tasks than threads, your thread processing handler just needs to pull jobs from a queue and do as many jobs as it can each loop (I’m also pulling jobs in the main callback rather than just have it sit around waiting for the threadpool handler to complete jobs).
I don’t always follow that rule though - but seems to be OK to create more threads than logical cores - either there’s going to be some time slicing of threads by the OS, or you have to loop your thread pool process as many times as necessary to clear jobs - overall process time should be similar if your threads do similar jobs… the exact strategy here also depends on if you can predict the number of threads you need vs the number of logical cores available…
It might actually be better to create as many threads as required for parallel jobs and let the OS time-slice/queue their processing as it sees fit - however I don’t know if Mac or Windows would penalize an application that asks for too many threads (and what ‘too many’ actually is) - and if there is a limit there, there is also the idea to run parallel real-time audio tasks as separate processes… but I really don’t want to go there.
There is for sure some system overhead to synchronising threads, and that may become too high as a percentage of available time with really short buffer sizes… I find I can get down to a buffer size of 128 without issues though.
More profiling required to validate all this theorizing.

Edit - here’s a better screenshot of Perfetto ui for visualising a realtime audio thread pool threads log in my synth - just for example - this was probably running in debug mode at the time:

Edit 2 - I saw some mention in stack overflow that max threads a process can request on Mac OS could be either 4096 or 8192 (depending on OS version/silicon)… maybe there’s something specific in Apple documentation about it - could not find in a quick search - I wouldn’t be surprised if they don’t publish it in a obvious place.
Max threads my synth engine will ever request is logical cores * 16 + logical cores… so on Mac M1 Air that would be 136… that is in the worst case scenario… typically it’s running between 32 to 64 threads depending on the patch/layers - reallocating any time the voice counts/layers change.

1 Like

Are these max numbers per plugin instance, or per callback thread from the host? Or per host = process = 1?

Following this with great interest.
Thanks very much for your investigations!

Cool! Thanks for the tip on Perfetto, I had never heard of that before. I’m going to have to try it when figuring this out. That looks super useful.

Maybe I’m lazy, but I see no reason not to use JUCE::Thread class with ‘startRealtimeThread’ - since it creates the necessary parameters based on a single miliseconds value.

Unfortunately I’m not using JUCE for this project. If I could I would :cry:

If your thread pool is for real-time processing, then no need to measure anything IMO - if your thread pool functions take longer than the available time in the callback then there is going to be a problem anyway (could probably even set this max time to half the callback time).

Good point on this one, I think I’ll have to play around with making the callback time and the thread time different numbers. Right now I have callBackTime == RealtimeConstraint

When creating a threadpool, it may be best to try to limit to the max logical cores (there’s a JUCE function to get that value)

One other thing to note is that in my experimentation with the OS Audio Workgroups I found this function:
os_workgroup_max_parallel_threads That recommends a value for the thread count. On my m1 MBP that’s 8

Thanks for the extremely detailed responses on this. I’ve been banging my head against the wall with these problems and it’s nice to get some external feedback and conversation about it.

1 Like

Per process, so for a DAW/host it would mean all available threads in the host app process are shared between plugins… unless the host can launch plugins in separate processes as Reaper now offers.
For a standalone app of course it is a single process not affected by other software in terms of max thread allocation (the actual exact numbers for Mac OS and Windows are unknown to me at this point and it might vary by processor type too).
Regardless of the theoretical max threads, it’s probably best to limit in a plugin instance to a ‘reasonable’ number… the host has no direct control over that of course… up to plugin developer to not hog the entire system with one plugin instance.
CLAP plugin format sounds like it will offer a solution to let hosts decide how to process parallel realtime threads from plugins (providing the plugin is architected that way of course)… so that may be ultimately the best solution once CLAP support becomes more common place.

1 Like

Unfortunately I’m not using JUCE for this project. If I could I would :cry:

Ah… I see… sorry for the assumption.

If your thread pool is for real-time processing, then no need to measure anything IMO - if your thread pool functions take longer than the available time in the callback then there is going to be a problem anyway (could probably even set this max time to half the callback time).

Good point on this one, I think I’ll have to play around with making the callback time and the thread time different numbers. Right now I have callBackTime == RealtimeConstraint

Not sure what you mean by setting callBackTime (presumably you mean for policy.period?) - I mean - for audio it’s fixed by the sample rate & blocksize right? Otherwise I suppose we might think to set policy.computation time to 50% of that period - or however long you think your plugin code is likely to take at the very most vs the callback period - a simple single audio effect plugin might never take more than1% of the callback period.

When creating a threadpool, it may be best to try to limit to the max logical cores (there’s a JUCE function to get that value)

One other thing to note is that in my experimentation with the OS Audio Workgroups I found this function:
os_workgroup_max_parallel_threads That recommends a value for the thread count. On my m1 MBP that’s 8

Yes - I’m not surprised that would be the value - and it’s probably just set to always report based on the number of logical cores (assuming a 2020 M1 Mac pro)… and they hope developers don’t make too thread heavy apps (to avoid performance degradation of other apps/background OS jobs), but I wonder, and would like to do some experiments as to whether for real-time audio specifically - it’s better to have few threads execute multiple times within a audio callback pulling jobs off a task list until all completed, or just create (outside of the callback) as many threads as there are jobs to do (even if it means double or more than logical cores)… kick them all off at the same time, and let the OS take care of time slicing/scheduling of those threads based on the thread policy/workgroup settings.
(For non-realtime background jobs I can see why it would be fine to just limit a thread pool to the number of logical cores - there is less concern on variation in process time for each jobs processing period.)
I may do some tests/logging to see which is best at a later date, but for now I need to get on with some other product development.

I AM posting on the JUCE forum, after all. Fair assumption to make. I just know this is where the cool audio folks hang out. :sunglasses:

Not sure what you mean by setting callBackTime (presumably you mean for policy.period?) - I mean - for audio it’s fixed by the sample rate & blocksize right?

Yep you’re totally correct on that count. We do a kind of strange rendering model where it renders in parallel to the core audio callback, so we have a bit more flexibility to play with these numbers. But you bring up a good point that it may be worth “going back to basics” and match our callback time to the one that naturally happens from calculating with with sample rate and block size. Good sanity check

but I wonder, and would like to do some experiments as to whether for real-time audio specifically - it’s better to have few threads execute multiple times within a audio callback pulling jobs off a task list until all completed, or just create (outside of the callback) as many threads as there are jobs to do (even if it means double or more than logical cores)… kick them all off at the same time, and let the OS take care of time slicing/scheduling of those threads based on the thread policy/workgroup settings.

Definitely an interesting idea! I’ll have to give it a go. I did have one case in which it appears that my realtime threads got permanently demoted. (see the peak render time jump, and the cpu percentage dip at the end of these charts)

Demoted

I figured it was either:

  1. I had been blowing over my reported constraint for the realtime threads
  2. I was running simply too many realtime threads.

I’ll have to tweak the params and see what I can find. I’ll report back to this thread when I find anything out :slight_smile:

1 Like