Wait for all jobs to finish in ThreadPool

Hi guys, I'm trying to multi-threadify a plugin I'm working on. Basically it's got a load of 'sound particles' flying around and I'm trying to spread their processing over all cores in order to up the maximum number you can have. Below is the code I'm using. Each particle operates completely independently from the others. Their output is accumulated into a buffer that is zeroed at the start. The while loop after supposedly waits for all the thread jobs to finish before allowing the main audio thread to continue - and indeed it does because I've checked and after the while loop the number of jobs in the threadpool is always zero.

Then why I'm I getting crackling in the output? It sounds like the while loop isn't waiting for them to finish. It's definitely not waiting too long thus causing buffer underruns, because the reverb that gets applied later reverbs the crackles, also if I set the ThreadPool to 1 thread all is dandy.

Any ideas?

Thanks.


//clear buffer ready for summing
memset(outBlock[0], 0, sizeof(outBlock[0]));
memset(outBlock[1], 0, sizeof(outBlock[1]));

// process the audio of each sound particle
for(int i = 0; i < processor->particleSystem->soundParticles.size(); i++)
{
    SoundParticle* soundParticle = processor->particleSystem->soundParticles[i];
    ParticleJob* job = new ParticleJob(*this, *soundParticle, outBlock);
    particleThreadPool.addJob(job, true);
}

while (particleThreadPool.getNumJobs() > 0)
{
    // wait
    particleWait.wait(2);               
}
1 Like

There's lots of problems with this approach most notably that you're allocating the jobs on the heap in the audio callback but more importantly that you're waiting for 2ms each time the jobs haven't finished? That's a crazy ammount of time in the audio domain and will waste those ms completely if a job finishes just afetr the wait is called.

What you need to do is create a list of the jobs to be run that is reused (i.e. no allocating) then pass that to a list of threads that pull the jobs off in a lock-free way and process them. You can then have a WaitableEvent that signals when all the jobs are finished to return from the calling thread.

Yes, Dave is speaking from experience here - in Tracktion we have a similar situation in our parallel mixer, and it took days of painful lock-free tweaking to create an algorithm that actually works in real-time. Honestly, it's very non-trivial to implement this correctly, and I doubt whether the ThreadPool could be used for it, as it uses mutexes internally.

Yes, to further what Jules said be aware that in a plugin, multi-threading your code may not make much of a performance difference. Hosts can split up activity over cores fairly well because they need to process lots of tracks which can be done in parallel. However, in a plugin, if you take up all those cores you're effectively stopping all the hard work the host does to multi-thread the tracks. You also introduce a lot more context switches and cache misses etc. as you're hoping all over the CPU.

I'm not saying that it won't be faster (it almost certainly will if you only have a single instance of it in a session) but be aware that this sort of thing is unlikely to scale linearly and is very tough to get right. There may be other, simpler places you can optomise it first.

+1

For ROLI Equator, we also hit this exact same thing, and ended up multi-threading its voice engine in the standalone app version, but running it single-threaded in the plugin version.

1 Like

Thanks for the replies. Right ok. Well, maybe I'll leave it single threaded. But ignoring the hideous crackles, I can basically octuple (on my machine) the max number of particles since it's not rinsing a single core. So it's quite desireable as I've optimised the s##t out of everthing else. 

Also, it's a 'big' plugin and you wouldn't really expect many instances of it in the DAW. 

So if I was going to try it (against two expert opinions), you'd say firstly not use ThreadPool, and basically fire off a bunch of threads and keep track of when they finish? Basically, ThreadPool but not?

 

If I were you, I would check 

https://www.threadingbuildingblocks.org/

 

Some general tips

- use as little as possible dependencies between the threads

- start/stop the threads before the audio processing, reuse the threads

- use not more threads then physical processor kernels

- use the audio callback thread itself for calculation (instead of waiting)

- use a hard spin lock in the audio-callback thread (not that one from the JUCE library) if there is noting more to do and there a still threads not finished.

Ideally the audio-callback thread should always calcuate the biggest parts, so presumably the other threads finishing earlier

 

 

Great, thanks for all the tips guys!