Recommended way to offload DSP on multiple cores


#1

Hi all,

I’m working on a realtime audio application running on linux. I’m using JUCE with ALSA directly. No JACK.

I have 4 cores and would like to spread the DSP I need to do across the cores. What would be the recommended way to do this? Right now I have 4 threads running at priority 10 (ALSA thread seems to have priority 9), and I signal the threads from the audio callback so start calculating audio samples, and I wait for the threads to finish, after which I do some minimal additional processing and then I pass the samples to the callback buffer. In the DSP threads I have a sleep(1) call to avoid that the threads consume too much CPU.

This works pretty well but I do have occasional dropouts that sound like the threads were not ready in time. Is the above approach OK or should I be doing this differently?

I’m using a plain mainline kernel 4.10 with IO scheduler set to noop and preemtive set to low latency desktop. I have the scaling governor set to performance.

Thanks for helping!
B


#2

I think you should use only 3 worker thread , the alsa thread being itself used as a worker thread. And the sleep 1 seems dangerous, I would use pthread_cond_wait & co instead


#3

I’ve written that same thing twice, once for Tracktion’s parallel mixer, and once for Equator (on realtime linux in the Seaboard grand, very much like what you’re doing).

It’s really really hard to get it to work without edge-cases causing glitches! The tricky bit is that it has to be completely lock-free. But yes, use one thread per core, no more, including your main ALSA thread, which should be the master, and don’t use sleep(), don’t do any allocation. Good luck…


#4

Hi @bschiett,

You are definetelly on the right track. As others mentioned, I recommend you to syncronise your threads using pthread’s conditional variable.

Consider that sleep() is ALWAYS a bad idea, specially in application code. If you need to use sleep() something is wrong. :slight_smile:

Like Jules said, we solved on the Seaboard GRAND but it took a lot of profiling/debugging to figure-out how to optimise so we don’t miss audio frames. I can tell that, like any optimisation, it is very system-specific.

Why did you change the I/O Scheduler?

Have you noticed any difference using the Preemptive model to low-latency?

That bein said, your normal JUCE application should run just fine on a vanilla Linux without any specific configuration.


#5

Thanks, do you mean pthread_cond_wait on the side of the audio callback and pthread_cond_signal on the side of the worker threads?


#6

Thanks, it was working pretty well on mainline 4.8 but now on 4.10 these problems came up. So I first looked at changing kernel settings. Changing the IO scheduler to noop improved things noticeably but changing the preemptive model to low latency didn’t do much for me.

I had a call to set the process to realtime priority at the beginning of my code, so I removed that thinking it was a bad idea to boost the entire application, and added calls to startThread with priority 10, for my worker threads, instead of just calling startthread(). I checked the juce code and found that startThread(9) is called when the device is opened so right now my worker threads have slightly higher priority than the ALSAThread. Maybe it should be the other way around, with ALSAThread having the highest priority?


#7

I have it lock free right now, but don’t I have to use a mutex if I want to use pthread_cond_wait / signal? At least that is what I see in the examples for those functions when I google around.


#8

and also pthread_cond_signal at the beginning of the audio callback to wake up the worker threads and tell them there is work to do


#9

Yes, you do have to use to maintain an external flag to control the pthread_cond_variable. From the manual:

When using condition variables there is always a Boolean predicate involving shared variables associated with each condition wait that is true if the thread should proceed. Spurious wakeups from the pthread_cond_timedwait() or pthread_cond_wait() functions may occur. Since the return from pthread_cond_timedwait() or pthread_cond_wait() does not imply anything about the value of this predicate, the predicate should be re-evaluated upon such return.

About the threading I recommend the following:

  • Main application and misc threads runs on non-RT priority.
  • Main audio thread (ALSA thread created by JUCE) is RT (SCHED_RR) already at a high priority
  • Other audio processing threads should run as RT on the same priority of the main audio thread because JUCE uses SCHED_RR, which makes the scheduler preempt the thread if another thread with higher or same priority wants to run.

With all that said, I recommend you change your JUCE code to use SCHED_FIFO instead.

On the Low latency policy, I don’t see any benefits as well. Normal scheduler policy already does very well most of the time, hence its soft real-time characteristic. If you really want low-latency with no deadline misses, you need to use the PREEMPT_RT patch with the full real-time scheduler policy.


#10

Thanks for the explanation. Using the pthread conditional wait functions still seems to perform worse than just using an Atomic and checking that in a while loop, to block in the worker threads. Isn’t it a better idea to use CPU affinities and isolcpus at boot time to prevent the threads from being pre-empted by the scheduler? Of course this is impossible for the case of running a synth on a PC or Mac but on embedded linux this is possible?


#11

Yes, CPU affinity is good because it optimises the sheduler, also ensures to use as much cache as possible from the particular CPU. It won’t make any miracles, though.

The idea is that step by step you can get to the performance and latency you want.