A fix for those Threads and Timers with static storage duration that cause hangs on quit

quick recap: one example of this problem has been faced when TimerThread has been changed from being DeletedAtShutdown to being managed by a SharedResourcePointer, see here.

The core of the problem is triggered by a sequence of events like this:

  1. A VST3 plug-in is loaded in a DAW (e.g. REAPER, but in the other topic it was the VST3 helper)
  2. The plug-in instance creates a static Singleton object that has a Thread object.
  3. The plug-in starts that Thread object, which in turns starts an underlying native thread.
  4. At that point, while the thread is still running, the DAW is instructed to quit.
  5. As per exit sequence, all native threads are halted by Windows (but the Thread objects that they were underlying to, are still valid)
  6. In the exit sequence, only after all native threads are halted, the static objects are cleaned up. It’s at this point that the destructor for the Singleton static object is called.
  7. The destructor of Singleton calls stopThread(-1) on its member Thread to stop it, before it gets destroyed.
  8. stopThread(-1) sees that its threadHandle value is still not null, thus it thinks the underlying thread is still running, but…
  9. As the thread was halted “natively” by Windows at the very beginning of the exit sequence, the JUCE code hasn’t had a chance to set the threadHandle to nullptr to reflect that it the thread has stopped.
  10. As a result, the destructor of Singleton waits forever inside stopThread(-1)

It turns out that, inside that Thread::stopThread(), it is actually possible to determine if the underlying native thread is still running or not, via the GetExitCodeThread() function: if it reports that the underlying native thread is STILL_ACTIVE, then it is still running as normal and the usual signaling to the Thread object that it has to exit should proceed, as usual.

But if GetExitCodeThread (threadHandle) reports that an actual exit code is available (it reports a value that differs from STILL_ALIVE), then it means that we are precisely in the condition described by the sequence above: we still have a valid threadHandle to the underlying native thread, but that thread has been halted by Windows.

Note that the threadHandle that our Thread has is still valid, even if Windows has halted the native thread: I have read in the docs (I’ll search for the exact URL, I cannot find it at the moment) that the underlying native object is only deleted when all the handles to it are closed, and our Thread has not yet closed its threadHandle.

With all the above being said, I have implemented a proof of concept fix directly in Thread::stopThread():

bool Thread::stopThread (const int timeOutMilliseconds)
{
    // agh! You can't stop the thread that's calling this method! How on earth
    // would that work??
    jassert (getCurrentThreadId() != getThreadId());

    const ScopedLock sl (startStopLock);

    if (isThreadRunning())
    {
        // --------------- START PATCH - YFEDE ----------------------- 
#if JUCE_WINDOWS

        /* Is the underlying native thread REALLY still running, or has it been halted by Windows,
           for example because the DAW was terminated? */

        DWORD threadExitCode = 1234;    // some value just to see if the call below changes it
        const auto [[maybe_unused]] success = GetExitCodeThread (threadHandle, &threadExitCode);
        jassert (success != 0);

        if (threadExitCode != STILL_ACTIVE)
        {
            DBG ("Windows has already halted the native thread for juce::Thread with name " + getThreadName ().quoted ());
            closeThreadHandle ();  // the thread is actually already terminated, so we clean up to reflect that state
            return true;
        }
#endif 
        // --------------- END PATCH - YFEDE ----------------------- 

        signalThreadShouldExit();
        notify();

        if (timeOutMilliseconds != 0)
            waitForThreadToExit (timeOutMilliseconds);

        if (isThreadRunning())
        {
            // very bad karma if this point is reached, as there are bound to be
            // locks and events left in silly states when a thread is killed by force..
            jassertfalse;
            Logger::writeToLog ("!! killing thread by force !!");

            killThread();

            threadHandle = nullptr;
            threadId = {};
            return false;
        }
    }

    return true;
}

I also have a proof of concept project in place where it shows that this makes a difference and solves the problem, it works both with JUCE 7 and with the most recent JUCE 8 on develop.
If you need to have a look at it, I need to tidy it up a little.

With this patch in place, I think Threads are allowed to be owned by objects with static duration. And as a corollary, maybe that regains us the possibility to also have Timers in objects with static duration as well, which is a feature that has been lost when TimerThread was switched from DeletedAtShutdown to SharedResourcePointer.

2 Likes

Thanks for sharing @yfede I’ll try and take a look at this at some point. To be clear though having a Timer with static thread duration should work now, no? There is an internal object that inherits from DeletedAtShutdown that triggers the thread to shutdown properly to avoid this issue. Timers can’t run without the MessageManager anyway so I think even after this is applied the behaviour should be the same for any Timers. It would however be a more generic solution for a Thread in general.

Taking a closer look maybe you needed a solution on JUCE 7?

Yes, and the issue I’m facing is actually twofold. I’m porting a project from JUCE 7.0.5 to the lastest commit in the JUCE 7 branch. This update of JUCE version has been paired with some refactoring of the project codebase, to improve performance.

That refactoring is very elegant and effective, but it has certainly introduced a case of “Thread used by a static singleton”. The fact that such pattern causes hangs on Windows has been discovered only after the change was completed, because development happens mainly on macOS.
My findings above, about the possibility of cleanly exiting from such Threads, is mainly aimed at salvaging this refactoring.

Regarding TimerThread specifically, my remarks above are based on what I have read in the thread I have linked. But I may have mistakenly understood from there that having Timers in static singletons is to avoid, because now (with SharedResourcePointer) it can cause the same hangs on exit that happen for Threads.
On this subject of Timers, I will check very soon if that happens, because there are a bunch of those “Timers in singletons” as well in the same project.

For now, can you confirm that the last JUCE 7 commit (16dac4ae) is still at the “state of the art” with regard to avoiding hangs due to TimerThread? The only differences I could find in this area between that commit and the current JUCE 8 develop are:

  1. usage of ThreadSafeListenerList in ShutdownDetector
  2. an assert in ~TimerThread()

and to me none of the above should constitute a difference in behaviour between latest JUCE 7 and current JUCE 8 with regards to this subject, correct?

Update: I have been able to reproduce the same exact hang issue on exit also with the sole use of Timers, not only with Threads.

It happens the same way both in the last JUCE 7 commit and in the lastes JUCE 8 develop (7ba1107d at the moment of writing), so even with the very latest changes, the problem still happens.

Steps to reproduce using REAPER 7.27 on Windows 10:

  1. Check that this setting in REAPER is unchecked, as per its default. If it was checked, uncheck it and restart REAPER because it doesn’t pick it up until the next app launch. If it is checked, it seems the hang is much less likely to happen.

  2. Unzip this attached plug-in project:
    ExperimentPlugin.zip (97.5 KB)

  3. Put a copy of latest JUCE 7 (commit “16dab”, but it also happens with latest JUCE 8, see caveat at the end) in a directory named “juce” that is a sibling of the “ExperimentPlugin” directory contained in the ZIP

  4. Build the VST3 with Visual Studio 2022, launch REAPER in the debugger and create one instance of the plug-in

  5. Save the REAPER project at this point, so it doesn’t ask for confirmation on exit (will explain this later)

  6. Click the button “start MyTimer in singleton” on the UI. Notice the counter starts counting

  7. Close REAPER

  8. At this point, observe that the Visual Studio debugger doesn’t exit, it remains in execution.

  9. Hitting the “pause” button of the debugger, reveals that execution is stalled on stopThread(-1) in ~TimerThread()

The hang may not happen 100% of the times but out of 10 tries, at least 7 times it does happen. It seems to me on my computer that the sooner REAPER is closed after the Timer has started, the more likely it is to fail to terminate properly.

I have caught it happening on video:

The fact that it doesn’t happen 100% of the time makes me think that there might be a race condition at play, but with my patch for Threads described in the first post, even this hang disappears. I’m sure that it’s the effect of it, because I see the output of the DBG() in my if being emitted.

This hang issue is reproduced the same way also with the lastest JUCE 8 (7ba1107d at the time of writing), with the caveat that the following jassert at the beginning of ~TimerThread must be commented out. When it is present, the debugger breaks on it (as expected) and that seems to suffice to defuse the race condition that makes the hang happen. With my patch in place, the jassert is no longer necessary because the TimerThread outliving the system event queue is supported.

Yes, please have a look at this with high priority. Some users on Windows already reported that our latest VST3 plugins do not pass the plugin scan anymore when they install our JUCE 8 updated plugins. Those plugins worked for years without any problems.
We start a timer in the main constructor of the plugin. We need this to process MIDI automation in the UI thread.

Unfortunately, it’s one of the bugs that is hard to reproduce and I’m not 100% sure that the new static timer code is the problem. But it makes absolute sense with the bug report from @yfede.

I also noticed the Timer singleton code and I was very disappointed that we go the singleton road here. So many side effects that can happen…

Edit: I need a fast solution here. Our plugins are already released and downgrading is not possible because of the new Font constructor. I will use @yfede fix as a workaround.

I am thoroughly testing my fix above and, while it is a step in the right direction in my opinion, there are cases in which it causes GetExitCodeThread() to reporting an error.
That happens when the handle was closed and set to nullptr by the thread beign stopped returning from run(), while the thread calling stopThread() is between the line where the thread handle is tested by isThreadRunning() and the line where it is used in GetExitCodeThread().

That’s a race condition that is not deepely problematic per se, but I’m working on a fix that also prevents this issue

1 Like

I haven’t been able to reproduce the issue myself but I think I’ve got an idea of what is happening in the Timer and I’ve managed to simulate the same issue.

What I think happens is

  • All the DeletedAtShutdown objects get deleted
  • The TimerThread is notified of the event via the ShutdownDetector
  • The TimerThread notifies the Thread to stop, this is asynchronous!
  • Windows gets to the point that it kills any threads still running, at this point the run() method of the TimerThread hasn’t quite finished so it’s forcibly killed before there is a chance to close the thread handle
  • Windows then calls the destructors and we’re stuck waiting for a thread handle to become nullptr that never does

I was able to simulate this by just adding an extra sleep at the end of the run() function in TimerThread.

So I suspect the race is between the thread completing by calling closeThreadHandle(), and Windows killing the thread. I could be wrong though.

It might be possible to stop the TimerThread synchronously from the callback but something in the back of my mind is telling me we did it asynchronous for a good reason.

I think fixing this issue in Thread is sensible and using GetExitCodeThread() certainly seems like it could be part of the solution, there’s a few considerations to take into account though.

One thing you might be able to do without changing JUCE is implement your singleton using the JUCE macros and have your singleton inherit from DeletedAtShutdown. That seems like a simple enough solution for now.

Another alternative have you considered using SharedResourcePointer instead? If the Processor had an instance it would still be kept alive for the lifetime of the plugin and as under the hood a static is involved I think it would also be shared across plugin instances if that’s required.

1 Like

IMO a plugin instance should be a closed independent unit. Not sharing anything with other plug-in instances. I think the complexity should be kept as low as possible.

The base UI and base Processor class could be capable of creating timer instances. Something like “createTimer()” and we would get a Timer interface and pass a timer reference to the object that needs it. The ITimer class could have a callback function and start-stop. This could be created parallel to the current static inherited timer for a start.

What did we win with that?

  • Simple lifetime handling.
  • No side effects when having multiple plugin instances (bugs, performance).
  • An independent timer runner for UI and Processor if we want.
  • Testability: pass a dummy implementation instead of a real timer for the test.

Just my thoughts. I also see the advantage and magic behind singletons. I just wanted to show what another solution could look like.

Edit: Or we get some kind of factory from the processor or UI base class that lets us create Timers, Action Listeners, Broadcasters, and Callbacks.

There are plugins that provide central control over instances, such as Virtual Sound Stage from
https://www.parallax-audio.com/

I am also working on applying this pattern to one of my commercial plugins, so I hope this will not be made more difficult somehow in Juce 8 (I am still using Juce 7).

In short, I do not agree with the closed independent unit idea.

I generally agree which is why I would encourage users do NOT use singletons wherever possible. Regardless there are times when users will need them and using a SharedResourcePointer is a slightly neater way to manage that IMO as the lifetime is much clearer.

To be clear the issue is that user code (not juce code) is keeping Threads alive with static lifetime duration. Doing this with a Thread in juce and calling stopThread (-1) in the destructor (which I strongly recommend) is what leads to the issue.

In the case of the TimerThread if there are no Timers, there is no TimerThread, as long as you don’t have static Timers you won’t have a problem! Even if we had a TimerThread per Timer (so no statics in juce code), the problem would still exist while we still call stopThread (-1) in the destructor.

If we don’t call stopThread (-1) in the destructor there are other things that are hard to guarantee, and we’ve seen crash logs that suggest these things are happening in very rare cases.

There are no plans to change the API, but as above, I strongly recommend you don’t have a static Timer or Thread! Use a SharedResourcePointer if you can. Have each plugin instance have one SharedResourcePointer and the lifetime will be correctly managed with a single shared instance that is destroyed when the last plugin goes out of scope.

2 Likes

Yes, but using a SharedResourcePointer has a significant drawback in terms of performance in our scenario.

The singleton we need, is slow to create and is only needed when the UI is displayed.
If we put SharedResourcePointers to it only in Editors, then the slow creation happens every time the user reopens the editor after having created it.
If we put one additional instance of the SharedResourcePointer in the Processor, to keep the pointer “alive” even when there are no Editors, that means that the slow creation happens every time a DAW project that contains the plug-in is loaded, even if the UI of the plug-in is not being displayed.

With static singletons we did get the best of both worlds in the sense that the slow creation only happens the first time the singleton is requested, and then it lives across Editor reopen operations without the need of being destroyed and then recreated each time.

Makes sense. Maybe use the built in juce Singleton macros so you can take advantage of DeletedAtShutdown? I realise this means getInstance() returns a pointer rather than a reference, you could possibly write a wrapper around that if you really need to? Not to say this won’t be fixed in JUCE but just to save you having to manage a fork.

Also does the slow object really need to be-a or have-a Timer? could there be something else that is-as or has-a Timer that references the slow object that is static?

One other thought is you could use the same trick TimerThread is currently using with the ShutdownDetector (this is currently private but you could have your own copy), that way you can be notified of a shutdown event and stop any threads.

FWIW, I ended up patching JUCE the following way, to avert the risk of static threads locking up on exit on Windows.
Sure, the best way to solve this would be to convert all our existing static singletons to be either DeletedAtShutdown or SharedResourcePointers, that is for sure.
But our codebase is large, and even after completing one such conversion, the “static thread issue” would remain an open problem. We would still be at risk of it to bite us back the moment we forget about the dogma of “don’t use threads or timers in static singletons” (or the broader “don’t use static singletons”, which feels a little too limiting honestly).

The outline of the changes to JUCE are as follows:

  1. I have encapsulated in a function named Thread::isNativeThreadRunning() the check for whether the native thread underlying the Thread object is still running. At the moment, this adds the GetExitCodeThread() trick on Windows, and on all other platforms it’s equivalent to isThreadRunning().
    This new function is private, as it only needs to be called from inside the Thread class in specific points: stopThread() and waitForThreadToExit(). Besides, using it as a “general purpose” replacement of isThreadRunning() is not recommended, because the latter is realtime safe while the new function is not (on Windows). The functions where the new function is called are already not realtime safe by themselves (they acquire locks or call system functions), so calling the new one doesn’t break any existing realtime safety guarantee of them.

  2. In the new function, I had to use a lock to guard agains changes to threadHandle between where it’s initially tested by isThreadRunning() and its subsequent use in GetExitCodeThread(). There’s a detailed comment in the code explaining why. This also means that I have added a ScopedLock to acquire the same lock in the (luckily few) places where there are assignments to threadHandle.
    I don’t like to add locks but this one seems inevitable. And I’m quite confident that it shouldn’t cause any deadlocks, because in the guarded code there are no loops that may end up waiting indefenetely, and this lock is always taken after the startStopLock, where that happens, so that it’s guaranteed that threads don’t deadlock by acquiring them in the inverse order.

Disclaimer: I only have tested it for a while these days, on Windows and macOS

The full code of the change is in this commit:

Yes, I agree with you that using the JUCE Singleton with its helper macros and SharedResourcePointer where possible is the right way to do it. The change above is only intended as a safety net agains hangs. And the specific project I’m working on at the moment will remain on JUCE 7 for at least the near future, so even a fix committed to JUCE 8 wouldn’t help us much here.

I have contemplated that, but since TimerThread itself is not immune to this issue by using ShutdownDetector, I have decided not to go that route for now.

The ShutdownDetector concept is intriguing though, I wish it will become a public class in the future because that would probably turn out to be helpful also for other use cases. I hope you will consider making it public anyway, regardless of the issue at hand.

I think the only issue here is that it’s trying to stop the thread asynchronously, if you stop your threads with stopThread (-1) as long as you don’t cause some kind of deadlock for yourself where your thread could in turn be waiting on something else I think you should be OK.

I certainly considered it but I wanted to make sure it had proved itself a little first.

1 Like

@yfede could you try the patch below please?

This hasn’t made it through an internal review so it could still change but I thought it worth sharing to make sure this is working for you.

Looking more closely at the docs I wasn’t sure GetExitCodeThread() was the right approach I think calling WaitForSingleObject() on the thread handle allows us to find out if the thread has finished without worrying about some other minor additional edge cases raised in the docs.

I also thought ideally we want isThreadRunning() to return the correct value that indicates the real state of the thread. Unfortunately because we need to be careful the handle isn’t closed while we’re calling these methods it introduces a lock into the method but the contention is only while starting/stopping a thread. Maybe I’ll switch away from this if it’s decided introducing a lock here is problematic for any reason.

I’ve updated Timer so it stops synchronously which I hope fixes the issues you’re seeing there?

JUCE-dev-7cc4770-3267298.patch (5.9 KB)

For stopping your own threads I’ve included another patch below for making the ShutdownDetector public. Again this isn’t approved internally and it may change.

JUCE-dev-6a5fefe-ShutdownDetector: Allow objects to listen for shutdown events.patch (37.5 KB)

Thanks for looking into this, I’m currently busy with other tasks at the moment but I’ll give you feedback ASAP

1 Like

@yfede did you ever get around to testing this? I’ve still got this on an internal MR but it would be good to get feedback that it’s working for you before merging

Sorry, I haven’t had the time to look into it. I’ll do it hopefully on monday and report back within the week

1 Like

Yes, I confirm that the patches you have attached seem to solve the issue on Windows.
I have checked with the test project with the patch applied to both JUCE 8 current develop and to latest JUCE 7 commit.

The lockup on REAPER exit never occurred after 20 attempts done with each JUCE version, whereas with the unpatched code the problem showed up an average ot 5 to 10 times for each round of 20 attempts.

The first patch alone, was also enough to avoid the lockup of a Thread (non-timer) that was running inside a static singleton, i.e. I didn’t even need to implement any custom logic using the ShutdownDetector made public (but that’s certainly a useful addition anyway).

On that subject, it seems that the second patch had a syntax error in that it passed listener references into listener list add() and remove(), which take pointers instead. It was an easy fix (see below) but I’m surprised that the code compiled on your end.

3 Likes