ThreadPool::addJob nuisance asserts and synchronization with re-runnable ThreadPoolJob instances

I have a ThreadPoolJob that I am using for cache management. The need for management arrives asynchronously, and I do not want to consume threadpool time by leaving the job in the pool. So when activity arrives, I reschedule it (::addJob). The issue arises if the ThreadPoolJob is already running, and I have new work for it. There is a race.

If the job is exiting, whether it has asked to be rescheduled or not, the ‘active’ flag will be cleared. The ‘pool’ pointer may or may not be cleared.

The only observable state for the job is ‘active’ via the ::isActive method. And so my rescheduling code can not avoid occasionally hitting the assert at L132 in ThreadPool::addJob.

My solution is either to comment out this assert - and accept that I am calling ‘::addJob’ when it may be unnecessary or make the state I really need to see (ThreadPoolJob::pool) is observable.

My solution is to add ThreadPoolJob::isScheduled
juce_ThreadPool.h L100

/** Returns true if this job is currently attached to a thread pool. */
bool isScheduled() const noexcept					{ return (pool != nullptr); }

This avoids the nuisance assert and ensures that that ::addJob is only called if it is necessary. There is however still a race that will have rescheduling being missed when there is new work for the ThreadPoolJob to do, and the ThreadPoolJob has already exited and is yet to be removed from the thread pool (ThreadPoolJob::pool assigned to nullptr).

So - I think there may be a higher level design issue that this case exposes; that a ThreadPoolJob of this type (reschedulable) itself may not know that it is done when it finishes running due to asynchronous activity in another thread. In this case a better solution may be a ThreadPool::ensureReschedule call that guarantees the specified ThreadPoolJob is not removed from the pool before it has had one more kick of the can through its ::runJob method.

For my purposes - the ::isScheduled call is a minimally intrusive fix. The proposed ::ensureReschedule mechanism needs a bit more thought and more extensive changes.

Curious to hear other thoughts on this.

ThreadPoolJob::isRunning is returning the expected result here - this method only tells you if the job is currently running.

The ThreadPool methods (contains, isJobRunning and others) all are synchronised, so there is no race condition present.

Thanks for the response.

I am not suggesting there is a bug in the code - just that there is a higher level synchronization requirement that I cannot see how to satisfy with the current mechanisms.

Consider a job A that has been scheduled, has run to completion, returned ::jobHasFinished and now is being processed by the tail of ThreadPool::runNextJob. A is an object that has a longer lifetime than a single run through a threadpool - ie. is started with ThreadPool::addJob(A, false).

Asynchronously, new work has arrived that A must process. When A exited its ::runJob method it did not know of the new work, and hence exited with ::jobHasFinished.

The problem is that the managing code must now ensure that A is rescheduled.

The sequence of operations in ThreadPool ::runJob after job completion is:
1: job->isActive → false
2: job is removed from jobs
3: job->shouldStop → true
4: job->pool = nullptr

Only after 4 has occurred will ThreadPool::addJob reschedule the job, otherwise it will be a no-op.

So ::contains and ::isJobRunning are both inadequate as they only give me partial information on where the locus of execution is before 4 and thus can give no guarantee that ::addJob will reschedule the job.

::waitForJobToFinish only guarantees we have completed 2, and hence ::addJob could still end up being a no-op. (And, as a matter of principle, I do not want to wait…)

So the race exists at minimum between the completion of 2 and completion of 4, where the state is unobservable and hence ::addJob may or may not do what is intended - ie. reschedule the job.

Does this make it clearer?

My work-around, to wrap the invocation of ‘A’ inside an object which can be fired/forgotten into the threadpool. This mechanism uses an atomic to observe if an instance of ‘A’ is available to do the work (has not yet exited) and schedules a new instance of the wrapped ‘A’ if not.

If you have another solution - I am wide open. It is possible that I am missing something.

Both contains and isJobRunning use a CriticalSection to synchronise their result with runNextJob. All of the steps 1 to 4 are performed whilst a corresponding lock is held, so I can’t see how you would run into the problem you are describing.

Got it. I missed that line in inspecting the code. Thanks.

So to summarize all of this - looking from the ThreadPoolJob:: perspective, I cannot get the information I need to ensure correct rescheduling. Looking from the ThreadPool:: perspective, I can.

Many thanks for sticking with me through this - my first solution was causing a subtle bug in my application which took me a while to sift through.

Cheers
Don