InterprocessConnection crashes occasionally during destruction

Here’s a very simple reproduction code:

class Connection : public juce::InterprocessConnection
{
public:
    Connection(bool isClient)
    {
        if (isClient)
        {
            connectToSocket("127.0.0.1", 1111, 1000);
            sendMessage(juce::MemoryBlock("1111", 4));
        }
    }

    void connectionMade() override{}
    void connectionLost() override {}

    void messageReceived(const juce::MemoryBlock &message)
    {
        juce::Thread::sleep(50);
        sendMessage(message);
    }
};

class Server : public juce::InterprocessConnectionServer
{
public:
    Server()
    {
        beginWaitingForSocket(1111);
        clientConnection.reset(new Connection(true));
    }

    juce::InterprocessConnection * createConnectionObject() override
    {
        return serverConnections.add(new Connection(false));
    }

private:
    std::unique_ptr<Connection> clientConnection;
    juce::OwnedArray<Connection> serverConnections;
};

class TestThread : public juce::Thread
{
public:
    TestThread() : Thread("TestThread")
    {
        startThread();
    }

    ~TestThread()
    {
        stopThread(-1);
    }

    void run() override
    {
        while (!threadShouldExit())
        {
            Server server;
            sleep(1000);
        }
    }
};

After you create a instance of TestThread , sooner or later, you’ll get a crash with the following stack trace:

|>|InterprocessConnectionCrashTest.exe!std::_Load_seq_cst_1(volatile unsigned char * _Tgt=0xddddddddddddddee) 行 296|C++|
|---|---|---|
| |InterprocessConnectionCrashTest.exe!std::_Atomic_load_1(volatile unsigned char * _Tgt=0xddddddddddddddee, std::memory_order _Order=memory_order_seq_cst) 行 338|C++|
| |InterprocessConnectionCrashTest.exe!std::atomic_load_explicit(const std::_Atomic_bool * _Atom=0xddddddddddddddee, std::memory_order _Order=memory_order_seq_cst) 行 495|C++|
| |InterprocessConnectionCrashTest.exe!std::atomic_load(const std::_Atomic_bool * _Atom=0xddddddddddddddee) 行 506|C++|
| |InterprocessConnectionCrashTest.exe!std::_Atomic_bool::operator bool() 行 641|C++|
| |InterprocessConnectionCrashTest.exe!juce::StreamingSocket::write(const void * sourceBuffer=0x0000024a77077a00, int numBytesToWrite=12) 行 487|C++|
| |InterprocessConnectionCrashTest.exe!juce::InterprocessConnection::writeData(void * data=0x0000024a77077a00, int dataSize=12) 行 170|C++|
| |InterprocessConnectionCrashTest.exe!juce::InterprocessConnection::sendMessage(const juce::MemoryBlock & message={...}) 行 162|C++|
| |InterprocessConnectionCrashTest.exe!Connection::messageReceived(const juce::MemoryBlock & message={...}) 行 24|C++|
| |InterprocessConnectionCrashTest.exe!juce::DataDeliveryMessage::messageCallback() 行 259|C++|
| |InterprocessConnectionCrashTest.exe!juce::InternalMessageQueue::dispatchMessage(juce::MessageManager::MessageBase * message=0x0000024a77090e10) 行 202|C++|
| |InterprocessConnectionCrashTest.exe!juce::InternalMessageQueue::dispatchMessages() 行 240|C++|
| |InterprocessConnectionCrashTest.exe!juce::InternalMessageQueue::dispatchNextMessage(bool returnIfNoPendingMessages=false) 行 124|C++|
| |InterprocessConnectionCrashTest.exe!juce::MessageManager::dispatchNextMessageOnSystemQueue(bool returnIfNoPendingMessages=false) 行 266|C++|
| |InterprocessConnectionCrashTest.exe!juce::MessageManager::runDispatchLoop() 行 128|C++|
| |InterprocessConnectionCrashTest.exe!juce::JUCEApplicationBase::main() 行 266|C++|
| |InterprocessConnectionCrashTest.exe!WinMain(HINSTANCE__ * __formal=0x00007ff6d9080000, HINSTANCE__ * __formal=0x0000000000000000, char * __formal=0x0000024a73dd42d9, int __formal=10) 行 106|C++|
| |InterprocessConnectionCrashTest.exe!invoke_main() 行 107|C++|
| |InterprocessConnectionCrashTest.exe!__scrt_common_main_seh() 行 288|C++|
| |InterprocessConnectionCrashTest.exe!__scrt_common_main() 行 331|C++|
| |InterprocessConnectionCrashTest.exe!WinMainCRTStartup() 行 17|C++|
| |kernel32.dll!BaseThreadInitThunk()|未知|
| |ntdll.dll!RtlUserThreadStart()|未知|

Then if you comment out the sleep before sendMessage in Connection::messageReceived, there will be another crash after running for a while: the classic Microsoft Visual C++ Runtime Library Debug Error! abort() has been called error.
Then if you let the program continue to execute, a new crash stack will appear after a while:

|>|InterprocessConnectionCrashTest.exe!juce::InterprocessConnection::deliverDataInt(const juce::MemoryBlock & data={...}) 行 267|C++|
|---|---|---|
| |InterprocessConnectionCrashTest.exe!juce::InterprocessConnection::readNextMessage() 行 319|C++|
| |InterprocessConnectionCrashTest.exe!juce::InterprocessConnection::runThread() 行 371|C++|
| |InterprocessConnectionCrashTest.exe!juce::InterprocessConnection::ConnectionThread::run() 行 29|C++|
| |InterprocessConnectionCrashTest.exe!juce::Thread::threadEntryPoint() 行 96|C++|
| |InterprocessConnectionCrashTest.exe!juce::juce_threadEntryPoint(void * userData=0x0000021a6cd8e6f0) 行 119|C++|
| |InterprocessConnectionCrashTest.exe!juce::threadEntryProc(void * userData=0x0000021a6cd8e6f0) 行 62|C++|
| |ucrtbased.dll!invoke_thread_procedure(unsigned int(*)(void *) procedure=0x00007ff64efa6f60, void * const context=0x0000021a6cd8e6f0) 行 92|C++|
| |ucrtbased.dll!thread_start<unsigned int (__cdecl*)(void * __ptr64)>(void * const parameter=0x0000021a6cda9590) 行 115|C++|
| |kernel32.dll!BaseThreadInitThunk()|未知|
| |ntdll.dll!RtlUserThreadStart()|未知|

By the way, I use Windows and use Visual Studio 2017 to compile and run this code under the Debug x64 configuration.

Hope the above information can help you solve this bug, thank you.

1 Like

Some additions:
In order to control the variables and narrow the scope of the problem, I pulled the server out and put it in a separate process. This server is only responsible for “loopback” the data packets sent by the client. The code for this simple server looks like this:

class LoopbackConnection : public juce::InterprocessConnection
{
public:
    void connectionMade() override
    {
        juce::Logger::writeToLog("Connection made!");
    }

    void connectionLost() override
    {
        juce::Logger::writeToLog("Connection lost!");
        delete this;
    }

    void messageReceived(const juce::MemoryBlock& message) override
    {
        juce::Logger::writeToLog("msg: " + message.toString());
        sendMessage(message);
    }
};

class LoopbackServer : public juce::InterprocessConnectionServer
{
public:
    LoopbackServer()
    {
        juce::Logger::writeToLog("listen to port 1234 " + juce::String(beginWaitingForSocket(1234) ? "succeeded" : "failed"));
    }

    juce::InterprocessConnection* createConnectionObject() override
    {
        return new LoopbackConnection;
    }
};

At the same time, I simplified the code of the test client:

class ClientConnection : public juce::InterprocessConnection
{
public:
    ClientConnection()
    {
        if (connectToSocket("127.0.0.1", 1234, 1000))
            sendMessage(juce::MemoryBlock("1111", 4));
    }

    void connectionMade() override{}
    void connectionLost() override {}

    void messageReceived(const juce::MemoryBlock &message)
    {
        juce::Thread::sleep(1);
        sendMessage(message);
    }
};

class TestThread : public juce::Thread
{
public:
    TestThread() : Thread("TestThread")
    {
        startThread();
    }

    ~TestThread()
    {
        stopThread(-1);
    }

    void run() override
    {
        while (!threadShouldExit())
        {
            ClientConnection conn;
            sleep(50);
        }
    }
};

Using the code of the test client above, the crashes mentioned before still exist.
This can basically prove to be a bug inside juce::InterprocessConnection, and I hope it can be fixed.

It seems fishy to me that you receive a reference to a message and immediately send that reference back. At some point the original message will probably be deleted? Try to create a copy of the message before sending it back.
I am using the InterprocessConnection since years without issues, so I’m almost sure there is no bug in there.

The implementation of the sendMessage method does not have any asynchronous behavior. The message will be sent and then the function will return. So as long as the referenced address will not be destroyed during the running of the messageReceived function, then directly passing this constant lvalue reference parameter to sendMessage will not cause problems. And if the message address may become invalid in the middle of the messageReceived function, it is obviously a bug.

Ok, while that might work: in your thread, you create a new connection every 50 milliseconds. I assume what you wanted is

void run () override
{
ClientConnection conn;
while (!threadShouldExit())
sleep(50);
}

Well, maybe my title is not accurate enough. What I want to express is that InterprocessConnection occasionally crashes during destructuring, so “create a new connection every 50 milliseconds” is the test code I wrote deliberately. I have corrected the title to a more accurate description.

The above piece of code can indeed reproduce the problem I said. The reason why you have not encountered this problem may be because your program does not destruct objects of this class enough times during a single run, resulting in not triggering this occasional (possible) race condition.

So is there anyone to fix this problem?

Alright, sorry I misunderstood your intention. And you are right, I think: I could reproduce a race condition where your ClientConnection destructor had already been called, but the WeakReference to the base class InterprocessConnection object that is being used in InterprocessConnection.cpp (lines 211, 213 and 258 in the current master branch) to prevent calling pure virtual functions on a deleted object was still “alive”. I guess this is something the JUCE team needs to fix.
EDIT: what I’d propose is a boolean (something like “stopCallingVirtuals”) in the base InterprocessConnection class that is set to false at construction and to true at the beginning of the deconstructor, and to check for that flag in the messageCallbacks on the lines mentioned above.

I tried this out, and was able to avoid the issue by adding a call to disconnect() in the destructor of ClientConnection.

    ~ClientConnection() override { disconnect(); }

After making this change, this ran on Windows for a few minutes without issues. On macOS, I built the client and server with thread sanitizer, and this didn’t trigger any issues.

Could you try making this change and see whether it fixes the issue for you? If it does, I’ll update the docs to make it clearer that connections need to be closed before the derived class is destroyed.

No, this is still not robust. I tested it with the method you mentioned and found that the InterprocessConnection::disconnect method is also unsafe (although the frequency of problems is lower than when it is not called, the problem still exists).

When it first ran, it hit the “!! killing thread by force!!” assertion in about 29 seconds. Observing the call stack, I found that “thread->stopThread (4000);” was called in the InterprocessConnection::disconnect method. For the situation where the socket thread cannot be stopped in 4 seconds, I can only think that there is not only a race condition but also some form of deadlock.

During the second run, the “Microsoft Visual C++ Runtime Library Debug Error! abort() has been called” exception was triggered in about 12 seconds.

In the third run, the above exception was also triggered at about 4:35.

Reducing the sleep time in TestThread::run will also make it take less time to reproduce the problem.

At the same time, during one of the runs, a new crash stack was triggered. The following is the complete call stack:

 	ntdll.dll!RtlRaiseStatus()	未知
 	ntdll.dll!RtlLeaveCriticalSection()	未知
>	InterprocessConnectionCrashTest.exe!juce::CriticalSection::exit() 行 48	C++
 	InterprocessConnectionCrashTest.exe!juce::GenericScopedTryLock<juce::CriticalSection>::~GenericScopedTryLock<juce::CriticalSection>() 行 228	C++
 	InterprocessConnectionCrashTest.exe!juce::SocketHelpers::waitForReadiness(std::atomic<int> & handle={...}, juce::CriticalSection & readLock={...}, bool forReading=true, int timeoutMsecs=100) 行 327	C++
 	InterprocessConnectionCrashTest.exe!juce::StreamingSocket::waitUntilReady(bool readyForReading=true, int timeoutMsecs=100) 行 496	C++
 	InterprocessConnectionCrashTest.exe!juce::InterprocessConnection::runThread() 行 342	C++
 	InterprocessConnectionCrashTest.exe!juce::InterprocessConnection::ConnectionThread::run() 行 29	C++
 	InterprocessConnectionCrashTest.exe!juce::Thread::threadEntryPoint() 行 96	C++
 	InterprocessConnectionCrashTest.exe!juce::juce_threadEntryPoint(void * userData=0x0000027a401c5840) 行 119	C++
 	InterprocessConnectionCrashTest.exe!juce::threadEntryProc(void * userData=0x0000027a401c5840) 行 62	C++
 	ucrtbased.dll!invoke_thread_procedure(unsigned int(*)(void *) procedure=0x00007ff725076f60, void * const context=0x0000027a401c5840) 行 92	C++
 	ucrtbased.dll!thread_start<unsigned int (__cdecl*)(void * __ptr64)>(void * const parameter=0x0000027a3cee5520) 行 115	C++
 	kernel32.dll!BaseThreadInitThunk()	未知
 	ntdll.dll!RtlUserThreadStart()	未知

So it seems that calling disconnect before destruction still does not solve the problem. So please see if there are other solutions to fix this bug, thank you.

Could you post the full source for your test cases please? I imagine my test harnesses are similar to yours, but reproducing the issue will be easier if everything is identical.

Of course, but LoopbackServer uses some user modules written by myself, mainly some Logger subclasses. If all are provided, there will be more dependencies to deal with. Because the server program itself has always been stable and has never crashed, I will only provide you with the complete code of the test client (currently all crashes occur on the client).

The test client is built based on the initial Hello World GUI project generated by Projucer.
I didn’t modify anything in Main.cpp and MainComponent.cpp, only added some very simple codes in MainComponent.h. The following is the complete content of MainComponent.h:

#pragma once

#include <JuceHeader.h>

class ClientConnection : public juce::InterprocessConnection
{
public:
    ClientConnection()
    {
        if (connectToSocket("127.0.0.1", 1234, 1000))
            sendMessage(juce::MemoryBlock("1111", 4));
    }

    ~ClientConnection() override
    {
        disconnect();
    }

    void connectionMade() override{}
    void connectionLost() override {}

    void messageReceived(const juce::MemoryBlock &message) override
    {
        juce::Thread::sleep(1);
        sendMessage(message);
    }
};

class TestThread : public juce::Thread
{
public:
    TestThread() : Thread("TestThread")
    {
        startThread();
    }

    ~TestThread() override
    {
        stopThread(-1);
    }

    void run() override
    {
        while (!threadShouldExit())
        {
            ClientConnection conn;
            sleep(5);
        }
    }
};

//==============================================================================
/*
    This component lives inside our window, and this is where you should put all
    your controls and content.
*/
class MainComponent  : public juce::Component
{
public:
    //==============================================================================
    MainComponent();
    ~MainComponent() override;

    //==============================================================================
    void paint (juce::Graphics&) override;
    void resized() override;

private:
    //==============================================================================
    // Your private member variables go here...
    TestThread testThread;

    JUCE_DECLARE_NON_COPYABLE_WITH_LEAK_DETECTOR (MainComponent)
};

Hope the above content can help you solve this bug.

From time to time I receive some reports from customers that my app crashes when reinitializing a client/server connection. I was thinking the problem was related to the buggy network driver on the embedded device where the app is running (on Yocto Linux, not Windows). But it looks similar to the problem described here…

1 Like

It seems to me the issue is that although there are WeakReferences to the IPC in the InterprocessConnection class DataDeliveryMessages and ConnectionStateMessages, it is possible to have your referenced object deconstructed right after the WeakReference nullptr check, so that any IPC method calls will then crash. The only way I could reliably fix this here was to add an atomic integer counter for the number of open messages that would be increased with each posted DataDeliveryMessage and ConnectionStateMessage, and decreased in their messageCallbacks. Then, wait for that counter to reach zero before leaving the ClientConnection destructor. Feels kind of hacky though.

Also having issues with the IPC crashing on destruction.

Hi, I’ve pushed a potential fix to develop. I’ve tested it with thread sanitizer and it seems a lot more robust than the previous implementation. Please let us know if you run into any more issues.

2 Likes

This fix is indeed much more stable, but it still does not seem to be a 100% perfect fix. In my test, after running for about 5 minutes, it still showed a “Microsoft Visual C++ Runtime Library Debug Error! abort() has been called” exception, and it only reproduced once. When I ran it again later, the problem never recurred.

We have evaluated that this probability may be acceptable, or maybe we can bypass this problem by not using the main thread to process connectionLost while ensuring that disconnect is called in the destruction. You know, in the last version, even if I don’t use the main thread, if I don’t call disconnect during the destructuring, there is a possibility that the program will have problems. The basic reason is that the destructor will first set callbackConnectionState to false before trying to stop the thread. At this time, if the thread is in the middle of deliverDataInt, problems will occur. In the previous version, if I did not use the main thread, and actively called disconnect during destruction, there would be no problems.

However, including this version is the same, there is still a logical inconsistency, that is, the purpose of setting callbackConnectionState to false in the original destructor is to not trigger connectionLost during destructor. And if I have to actively call disconnect during destructuring, then my connectionLost will always be called, and what InterprocessConnection does during destructuring is meaningless. And if you want to avoid connectionLost being called during the destruction process, you can only rely on the user to implement such logic in their own code. So forcing users to manually call disconnect may not be the best way, or at least a method similar to disconnectWithoutCallback should be provided, or a mechanism that can not trigger a callback when disconnect is called, just like the role of NotificationType in the GUI module.

Going back to the crash problem we talked about at the beginning, what I want to know is, is there any reason why this crash is inevitable and can only reduce the probability of occurrence? If so, I would like to understand the underlying reasons behind this. If not, I still hope that we can find a solution to solve this problem to perfection.

The crash problem that still occurs extremely occasionally, as well as the logical contradiction problem mentioned above, I still hope that you can take a look and continue to solve these two problems. Thank you very much.

I’ve tried leaving the client running on both Windows and Mac for upwards of 15 minutes and haven’t encountered any issues with the latest version. The Mac version was built with thread sanitizer, so I’d really expect it to warn if there were any remaining issues. The only potential issue I can think of would be if the background thread were to take longer than 4 seconds to stop. Perhaps we can provide an option to wait indefinitely instead. Could you try changing the timeout in the disconnect function to -1 and checking whether that resolves the issue?

Due to the design of the InterProcessConnection class, the disconnect call is strictly necessary. By the time execution reaches ~InterprocessConnection, the derived part of the class will have been destroyed, and any virtual calls to connectionMade, connectionLost or messageReceived at this point will lead to crashes. We can’t leave it to the base class destructor to cancel callbacks as there may already be callbacks in flight at this point. I see that requiring the call to disconnect may end up triggering more callbacks than previously, though. I’ll add an option to disable the callbacks during disconnect.

Thanks for such a detailed reply. I’m a little busy today, what you said I will test tomorrow and tell you if the problem is solved.

I just discovered a new and more important crash bug. When using sockets, InterprocessConnection is basically relatively stable now. But when I changed to using named pipes, I found that even with the new version, there were occasional crash bugs during the destruction, even if I did not use the message thread, and explicitly disconnected it during the destruction. And its probability is much higher than when using sockets. And I only need to establish a two-way connection, without even sending any messages, to reproduce the crash.

The following is my modification to the previous test program. This test program no longer needs an independent server program, only need to run this program, you can reproduce the crash in a very short time (basically running within ten seconds).
This test program is still based on the Hello World GUI project, and also only adds something to MainComponent.h. The following is the entire content of the new MainComponent.h:

#pragma once

#include <JuceHeader.h>

class TestPipeConnection : public juce::InterprocessConnection
{
public:
    TestPipeConnection(bool isCreator) : InterprocessConnection(false)
    {
        if (isCreator) createPipe("1234", -1, true);
        else connectToPipe("1234", -1);
    }

    ~TestPipeConnection() override
    {
        disconnect();
    }

    void connectionMade() override{}
    void connectionLost() override {}

    void messageReceived(const juce::MemoryBlock &message) override
    {
        juce::ignoreUnused(message);
    }
};

class TestConnections
{
public:
    TestConnections()
    {
        creatorPipe.reset(new TestPipeConnection(true));
        connectorPipe.reset(new TestPipeConnection(false));
    }

private:
    std::unique_ptr<TestPipeConnection> creatorPipe, connectorPipe;
};

class TestThread : public juce::Thread
{
public:
    TestThread() : Thread("TestThread")
    {
        startThread();
    }

    ~TestThread() override
    {
        stopThread(-1);
    }

    void run() override
    {
        while (!threadShouldExit())
        {
            TestConnections conn;
            sleep(50);
        }
    }
};

//==============================================================================
/*
    This component lives inside our window, and this is where you should put all
    your controls and content.
*/
class MainComponent  : public juce::Component
{
public:
    //==============================================================================
    MainComponent();
    ~MainComponent() override;

    //==============================================================================
    void paint (juce::Graphics&) override;
    void resized() override;

private:
    //==============================================================================
    // Your private member variables go here...
    TestThread testThread;

    JUCE_DECLARE_NON_COPYABLE_WITH_LEAK_DETECTOR (MainComponent)
};

This code may trigger several crash stacks at runtime, one of the most frequently seen is this:

 	ntdll.dll!RtlpFlsDataCleanup()	未知	非用户代码。已加载符号。
 	ntdll.dll!LdrShutdownThread()	未知	非用户代码。已加载符号。
 	ntdll.dll!RtlExitUserThread()	未知	非用户代码。已加载符号。
 	KernelBase.dll!FreeLibraryAndExitThread()	未知	非用户代码。已加载符号。
 	ucrtbased.dll!common_end_thread(const unsigned int return_code=0) 行 276	C++	非用户代码。已加载符号。
 	ucrtbased.dll!_endthreadex(unsigned int return_code=0) 行 290	C++	非用户代码。已加载符号。
>	InterprocessConnectionCrashTest.exe!juce::threadEntryProc(void * userData=0x000001a716d6ed10) 行 63	C++	已加载符号。
 	ucrtbased.dll!invoke_thread_procedure(unsigned int(*)(void *) procedure=0x00007ff6793674a0, void * const context=0x000001a716d6ed10) 行 92	C++	非用户代码。已加载符号。
 	ucrtbased.dll!thread_start<unsigned int (__cdecl*)(void * __ptr64)>(void * const parameter=0x000001a716d77530) 行 115	C++	非用户代码。已加载符号。
 	kernel32.dll!BaseThreadInitThunk()	未知	非用户代码。已加载符号。
 	ntdll.dll!RtlUserThreadStart()	未知	非用户代码。已加载符号。

Then the strangest crash stack is that in the “JUCE IPC” thread, it crashes to an address that cannot display the complete stack frame. The entire crash stack has only one layer and only one address, and no symbol information can be displayed.

These problems give me a very headache, and our applications not only need to use sockets, but also need to use pipes, and they all need to run for a long time, which means that there are many objects that need to be destroyed frequently. Therefore, I sincerely hope that you can help us solve these unstable crash problems. Thank you very much.

Hi, is there any progress on the above problem? I just found that after calling disconnect, letting the calling thread sleep for a period of time (such as 5 milliseconds) and then end the destructuring and destroy the object can temporarily alleviate this problem, but it is obvious that this solution is very inelegant and will reduce the performance of the program. And changing to a slower machine environment may no longer be effective at this time.

So will we have a more perfect solution to this problem?