Possible hang when waiting for an aborted ThreadPoolJob

Consider this sequence of events on a ThreadPool with one thread:

  • ThreadPool::addJob successfully adds a job to the array of jobs and starts the thread
  • before ThreadPool::runNextJob gets to run, someone calls ThreadPoolJob::signalJobShouldExit on the just-added job
  • when ThreadPool::runNextJob gets to run, the first loop where it iterates to find a job doesn’t come up with anything because job->shouldStop is true

So far this seems fine, the job never runs. I don’t see that it ever gets removed from the jobs array but perhaps that’s not so bad.

The struggle comes when someone calls ThreadPool::waitForJobToFinish waiting for the aborted job to finish, potentially waiting forever. jobFinishedSignal is never signalled so it hangs.

I don’t see a particularly easy way around this…or at least one I can guarantee is less racy than what the code currently does. Any ideas, or at least confirmation that I’ve found a hole?

Thanks much.

-DB

Umm…I haven’t looked at the ThreadPoolJob code but one solution is to change job identifiers from a small integer into a reference counted object pointer. This way a job persists until the last reference is gone. So after a job is finished or gets signaled to exit, it can be put into a state suitable for inspection (i.e. usable in a call to waitForJobToFinish).

I believe that is the solution used by boost::thread and std::thread.

Hmm, this will require me putting on my threading-head and giving it some hard thought. Please remind me again if I don’t look at this soon!

Ping?

A workaround is to use boost::thread and re-implement ThreadPool yourself. It should take less than a day.

The ThreadPool class needed a bit of a spring-clean… I’ve been through and tidied it up now, hopefully sorting out this problem in the process. Let me know if you still have any trouble with it!

Interesting, a quick and pain-free solution. But how do you know when the right time to call removeAllJobs () with deleteInactive==true is?

That method no longer has a deleteInactive argument. Each job now knows whether it should be deleted, so there’s no need to tell it what to do. (God knows why I didn’t write it that way in the first place!)