Multi-threaded-core mixer

I have been working on a mixer for live use.  It is open ended for the number channels and number of mixes.  Each channel strip is open ended for the number of eq filters that can be applied.  It also has vst plugin capability.  So I am trying to get it to work on more that one core.

 

I have an example setupf for testing, with 32 channels, 6 filters, and 10 stereo mixes. My hardware is soniccore scope running 44.1k sample rate and 3 ms latency. 

Running this configuring in a single thread(Compiled in debug mode on windows), I see cpu load of about 50%.  This is measured using the preformance counters and measuring time used, vs time between asio callback.

 

But I am testing on a 4 core cpu and it will be used on a 6 core sytem later, so I decided to split the work between threads.  This has been an interesting exercise that I have not comlpetely solved yet. 

What I did was create cpu-1 threads instead of one(so with 4 cores I get 2 extra threads).  The asio callback thread is a controlling thread.  It sets a ThreadLock that the other 2 treads block on.  When the asio thread calls back, it releases the lock and the 2 waiting threads wakeup.  I use atomic increment and decrements for the treads to control the tasks.  If the asio callback thread finishes first, it must wait for the other threads.  I am using a spin lock for this.  When all tasks are complete, the 2 extra threads block on the mutex again and the asio thread returns. (there some intermedia syncroniztion involved too).

This actually works quit well.  I see the load drop to about 20% which is about what I expected.  However, I get occasional drop outs where the single thread operation is rock solid.

I think my problem is the threads may not be assigned on different cpus based on current system demand.  For multithreading to be really usefull I think the threads must be on different cores or all bets are off as to whether this will work well.  Everything I read about thread affinity is "don't do it".  Also, not sure how much latency is involved with thread locks.  I asume they are built on standard windows messages.

I did not use Juce thread classes(I was lazy and used standard windows stuff I was familiar with)

Here is the quesiton.  If you have  many tasks, and multple threads, and the taskes must be done in groups sequenciall(so all threads will work on channel processing, then wait for each other, then all do mixes and wait, then do vsts), how should this be done?  Any suggestions?

I know this is a long post, but not sure how much info I need to get the question asked properly.

 

http://www.rossbencina.com/code/real-time-audio-programming-101-time-waits-for-nothing

Thanks, I had already read this.  But itis not really helpful.  When you have muliple threads working together, there is no getting around the fact someone will have to wait for something.  This article basically says it is impossible and maybe that is the answer.  If the porgam had complete control of a core (I.E, windows cannot take it), it could be done. 

It seeem like this would be possible.... Say you have 6 cores.  Create 4 threads, and use Thread affinity to force them to different cpus.  Then,  have each thread runn 100%.  Never do a system call, only do the processing and spinlock foreven waiting for the controlling thread.  You could guarantee the controling thread never blocked longer than the giving time. (assuming there is enough cycles to finish all the tasks).  As long as there is one extra core for the os to use this should work.

This would obviously be waiting lots of cpu, but if system is dedicated for that functionality only, id doesn't matter.  After, that'swhat the os does with idle time, burn cpu cycles.

It's certainly possible - we did it for tracktion's rendering engine and it works really well.

I don't think there's any need to worry about affinities as long as you use a thread pool of an appropriate size, but using lock-free data structures is the important bit. You mentioned using a spin-lock, and I'd recommend against that - WaitableEvents or mutexes are generally better for this kind of thing.

1 Like

In my Multiprocessor version of AudioProcessorGraph i use WaitableEvents to signal when a thread has finished, but there are situations where the audiocallback-thread itself has to wait, and using a spinlock give me much better results and lower possible latencies.

I not finished the midi-part, but as long you only use audio-connections, this should work well :)

https://github.com/jchkn/ckjucetools/tree/master/AudioProcessorGraphMultiThreaded

More Info:

http://www.juce.com/forum/topic/multithreaded-audioprocessorgraph-source-code

Would be great if something like this will be added to juce...  

 

 

 

 

To the OP : I'd certainly be interested to know how you progress with this, because it's something I've thought about but never really got round to trying.  My interest would probably be more on iOS than Windows but the same principles apply.

To confirm we're talking about the same thing, I gather you want to do this:

1- At the beginning of a processing block, fork to several parallel tasks, each on its own thread (depending upon available cores)

2- Let each task run to completion without locks or waits

3- Join the tasks.  This will involve waiting for the task which takes the longest.

4- Do all of the above in a short timeframe (eg. 3ms) and never ever go over!

 

Number 2 should be the 'easy' bit - as long as each task is given a local copy of the data it needs, it should be able to run to completion without a single lock or wait.

It's the forking/joining method and the need to do it reliably in 3ms I wasn't sure about.  Take the 'fork' for example - if a task thread is sleeping (eg. waiting on an event), and the event is signalled, how quickly will the task start?  Does it wait until the next scheduler quantum (which I think is >5ms on Windows) or is it much more instant?  If it's much more instant, what sort of times are we talking about and can they actually be quantified?  If it's not instant, an alternative would be for the worker threads staying awake and polling in a tight loop for a 'start' flag, but that's going to burn CPU like nobody's business ...

Maybe this would be of interest ? http://calvados.di.unipi.it/dokuwiki/doku.php?id=ffnamespace:about

Looks interesting.  Have you used it?

Have a look at the PPL (bundeled with Visual Studio 2010 and later), or theportable vaiant pplx in the Casablanca REST framework.

There is also a pplpp (PPL Power Pack) project on Codeplex that brings in very useful new functionality.

You should get by with the Task<> classes, and for synchronisation, use the when_all() continuation (in pplpp).

You will get wait-free code, expressed as Tasks rather than threads, running on a thread pool with work stealing etc, that will in

addition be asynchronous in nature and allows for the proper sequence of processing, while being able to parallelize what can be

parallelized at a given point in time.

Task<> is an extension to std::future<>, and it is expected the PPL model of task-based execution will be in C++14.

http://msdn.microsoft.com/en-us/library/dd504870.aspx

http://msdn.microsoft.com/en-us/library/jj987780.aspx

http://pplpp.codeplex.com/

And, just for clarification, the Task<> model, just as the std::future<> model, takes in the actual task code to execute as function objects, i.e. Functors, Lambdas or raw function pointers. I personally like to use Lambdas.

I am personally building a multithreaded plugin host on top of Juce with that, and so far it looks quite promising.

 

It's certainly possible - we did it for tracktion's rendering engine and it works really well.

Jules, did you decide not to share the code (which is perfectly understandable) or did you just not have time to retrofit it in Juce ? 

Yeah, the tracktion rendering graph bears no resemblence at all the the juce classes, so converging the two things isn't really an option.