Possible crash in ThreadPool


#1

I’ve got a crash in runNextJob:

    if (job != 0)
    {
        JUCE_TRY
        {
            ThreadPoolJob::JobStatus result = job->runJob();

            lastJobEndTime = Time::getApproximateMillisecondCounter();

            const ScopedLock sl (lock);

            if (jobs.contains (job))
            {

The other thread is doing removeAllJobs
I’m not sure about the locking order here.


#2

Can’t see any errors… It marks a job as being active, releases the lock and runs it. removeAllJobs() will see that the job is active, and wait for it to finish. All looks ok AFAICT. The locking order looks fine.


#3

I think it’s a multithreading issue here.
On the line job->runJob(), can the “job” pointer points to garbish because the other thread called removeAllJobs between if (job != 0) and ScopeLock.


#4

…no, I can’t see a way that it could do that.


#5

I only have a backtrace, in which I have 3 threads, the juce Internal timer thread, the job thread, and the message thread.
It’s crashed in runNextJob, and the other thread is in the ThreadPool destructor / removeAllJobs.
From the instruction pointer (eip) it’s somewhere in the JUCE_TRY and catch(…) but I don’t have more information. I only have one job in the pool.
I can’t reproduce it here, seems like it happened once on a multicore test machine.

I can be the runJob that’s throwing something, but none of the my job use exceptions.


#6

Are you sure it wasn’t a crash in your own runJob() code?


#7

The backtrace doesn’t say so. But since I call libraries with frame pointer omission, it’s possible (but usually with FPO, I don’t get any backtrace at all).
What’s strange however, is that the backtrace show “abort” and “raise” just above/before the runNextJob call, which sound like it’s sent a signal by itself (I don’t know the signal code, can be SIGTRAP or any other one).

I’ll add a signal handler for all signal and dump the backtrace in there so I can have more information. This only happened once, so it’s quite hard to reproduce.


#8

Well, I can’t see any threading problems in the code, I don’t really know what else to suggest…


#9

Was able to reproduce, and here’s the complete GDB dump:

Program terminated with signal 6, Aborted.
#0  0x00007ff9915f2f45 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64      ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
        in ../nptl/sysdeps/unix/sysv/linux/raise.c
Current language:  auto
The current source language is "auto; currently c".
(gdb) bt
#0  0x00007ff9915f2f45 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007ff9915f5d80 in *__GI_abort () at abort.c:88
#2  0x00007ff993946131 in unwind_cleanup (reason=<value optimized out>, exc=<value optimized out>) at unwind.c:111
#3  0x00000000005360d6 in juce::ThreadPool::runNextJob (this=0xaba440) at ../../src/threads/juce_ThreadPool.cpp:389
#4  0x0000000000536366 in juce::ThreadPool::ThreadPoolThread::run (this=0x10d7da0) at ../../src/threads/juce_ThreadPool.cpp:87
#5  0x0000000000537178 in juce::Thread::threadEntryPoint (thread=0x10d7da0) at ../../src/threads/juce_Thread.cpp:65
#6  0x00000000005372f9 in juce::juce_threadEntryPoint (userData=0x10d7da0) at ../../src/threads/juce_Thread.cpp:88
#7  0x0000000000669ee6 in juce::threadEntryProc (value=0x10d7da0) at ../../src/native/linux/juce_linux_Threads.cpp:40
#8  0x00007ff99393f73a in start_thread (arg=<value optimized out>) at pthread_create.c:300
#9  0x00007ff99168c69d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#10 0x0000000000000000 in ?? ()
(gdb) i th
  2 Thread 3370  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:220
* 1 Thread 3374  0x00007ff9915f2f45 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
(gdb) t 2
[Switching to thread 2 (Thread 3370)]#0  pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:220
220     ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S: No such file or directory.
        in ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S
Current language:  auto
The current source language is "auto; currently asm".
(gdb) bt
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:220
#1  0x00000000006708f1 in juce::WaitableEventImpl::wait (this=0x10d7d30, timeOutMillisecs=20)
    at ../../src/native/common/juce_posix_SharedCode.h:112
#2  0x000000000066671f in juce::WaitableEvent::wait (this=0xaba4a8, timeOutMillisecs=20)
    at ../../src/native/common/juce_posix_SharedCode.h:166
#3  0x0000000000535ca5 in juce::ThreadPool::removeAllJobs (this=0xaba440, interruptRunningJobs=true, timeOutMs=4000,
    deleteInactiveJobs=false, selectedJobsToRemove=0x0) at ../../src/threads/juce_ThreadPool.cpp:297
#4  0x0000000000535372 in ~ThreadPool (this=0xaba440, __in_chrg=<value optimized out>) at ../../src/threads/juce_ThreadPool.cpp:119
#5  0x00007ff9915f7412 in __run_exit_handlers (status=0, listp=0x7ff99190f4c8, run_list_atexit=true) at exit.c:78
#6  0x00007ff9915f7465 in *__GI_exit (status=17661236) at exit.c:100
#7  0x00007ff9915dfac4 in __libc_start_main (main=<value optimized out>, argc=<value optimized out>, ubp_av=<value optimized out>,
    init=<value optimized out>, fini=<value optimized out>, rtld_fini=<value optimized out>, stack_end=0x7fff1491c398)
    at libc-start.c:254
#8  0x000000000040ea69 in _start ()

Reading the unwind_cleanup code (very short), the comment, above the only abort() in the function, says this:

/* We're performing pthread_exit or something else that isn't normal cancellation, so just die.  */

I’m digging deeper to understand how to solve this.
Seems like the pthread_exit and C++ exception don’t work well together, please read: http://gcc.gnu.org/ml/gcc/2003-12/msg00743.html


#10

So this is all being called during a static object’s destructor, after main() has returned? You should have deleted any major objects like thread pools long before it gets that far. Use DeletedAtShutdown instead.


#11

Ok, can you be clearer here? What kind of objects have to be deleted before the GUI exits ?
From the static objects I’m using: String, ThreadPool, CriticalSection.
I didn’t thought that they were related to GUI in anyway, but I’m probably wrong.


#12

This is what I had, in my main component destructor I used to called ThreadPool::removeAllJobs.
One of the jobs was trying to take the message manager lock and deadlocked here. The thread pool instance is static (which seems bad from what you said).

Now, I’ve made this:

  1. I’ve created a ThreadPoolCleaner class which derive from DeletedAtShutdown
  2. In the destructor, I’m setting a flag “shuttingDown” and removing all jobs of the ThreadPool instance.
  3. The job taking the lock now check the flag, and avoid blocking in that case.

It’s working, but still ThreadPool instance is still deleted at static deletion stage (but it’s empty by that time).
The reason why the ThreadPool is static, is that the job comes from a database and starts as soon as the DB link is ready (somewhere on the startup stage, in initialise, but before the main window is created).
They are lengthy (non-gui related) jobs so they are executing in the background (hence the threadpool).
When a job finish, it reflect this by using the asynchronous sendChangeMessage() (for the GUI). I could have used a WaitableEvent instead and pooled this in a timer loop, but I thought the asynchronous stuff was simpler.

So the question is, since the job will exit without taking MML and calling the sendChangeMessage() function, is it ok to use ThreadPool that way ?


#13

Almost none of the Juce classes are designed to work outside of the juce_initialise/juce_shutdown period. You’d probably be ok to use a static String or CriticalSection (assuming the OS calls that they use will work ok in that situation), but threadpool? No chance! It uses Thread, which uses all kinds of static data that will have been wiped during the juce_shutdown call.

Basically: use singletons and make them DeletedAtShutdown. And that’s true in general for c++, not just for my code. It’s pretty much impossible to build complex sets of objects that init/destruct themselves statically in a safe way - the c++ std lib has to resort to all kinds of tricks to make it work. Better just to keep things simple and avoid it.


#14

[quote=“X-Ryl669”]I’ve got a crash in runNextJob:

    if (job != 0)
    {
        JUCE_TRY
        {
            ThreadPoolJob::JobStatus result = job->runJob();

            lastJobEndTime = Time::getApproximateMillisecondCounter();

            const ScopedLock sl (lock);

            if (jobs.contains (job))
            {

The other thread is doing removeAllJobs
I’m not sure about the locking order here.[/quote]

Parsing the code to understand the question, I can’t help but noticing that the ScopedLock is third in terms of code lines. It’s my understanding that it would actually be created first, before any code runs (or else it could be re-ordered to run anywhere, which is way worse).

Jules, if that’s the intent (lock the pool before running the job), you might want to consider doing what I’ve been doing - placing the ScopedLock definition right at the top of the block it affects, so it represents the actual execution order.

Bruce


#15

The code is correct, since the ThreadPool never delete a job that is marked as active (few lines above, with lock taken). It removes it from the array, but doesn’t delete it.
I agree that this code looks strange.


#16

Yeah, the code is correct - it has to unlock the pool before the job runs, otherwise nothing else would be able to access the pool while a job is running.


#17

Did you see my point?

Is run jobs definitely going to happen before the scopedlock gets created? I tend to put my scopedXXX as first line in a fresh code block so I can track where I think it happens, but you seem to have a different (deeper, I’d guess) understanding.

Bruce


#18

Yep. The compiler will keep the order of construction the same as the way it’s written in the code.


#19

Sorry for uping this thread but it seems to be always true.
Indeed, I have this type of code running :

  • Main thread creates a threadpool containing jobs
  • Jobs can create threadpool to run jobs
  • Each job ends with jobHasFinishedAndShouldBeDeleted status in order to be deleted from pool
  • When jobs end, parent job exits from runJob()
    At this point, when job is deleted (~job()), the threadpool is also deleted. But its threads keep running, and when they are asked to be deleted, they popup an assertion error because of they are killed by force.
    I do not know what’s going on internally, but when threadpool is deleted, its own threads are still running and do not want to exit…
    Does someone can help me ? Am I misleading something ?

NB. : excuse me for my poor english, but I am french… that does not excuse everything but I try to speak clearly :wink:

EDIT : Finally it seems that in case where there are many (many^10) threads, at the deletion time of threadpool, its threads are not priority so they are not deleted properly. I set priority to 1000 in order to try this theory, and its seems to work as this. Waiting for your thoughts about that.