Multi threaded processing, call for help


#1

Hi
I have been trying to improve performance by rendering my audio processors using multiple threads but I’m running in to some issues that I hope someone with more experience can help me figure out.
Here is a simplified explanation of the setup:
I have, lets say between 4 and 64 AudioProcessors that need to have their ProcessBlock called.
When not using my “multi thread setup” I would just loop these in the AudioDevice callback and process each of them in turn.

Then I was thinking, what if I create threads (as many as cpu cores) divide the processors between them, and make them run in parallel.

The way I set it up is : “jobs” are pre-sorted to avoid conflicts/waiting, each thread loops through it’s jobs, and then goes in to wait(x)
The main call back loops all the threads and polls If they are done, and once all are done it moves on.
On the next call back it calls notify() on the threads to make them wake up again, and re run their jobs.
No critical sections are involved (from my side), just some atomics to get “job count” and “done count”
It is working fine, but when pushing down the buffer size, drop outs occur much sooner then when just running them all in a single loop in the main call back. (there is a big improvement at higher buffer sizes, so there is still something to this)

  • Are there any performance issues with wait() / notify()?

  • I see a lot of implementations will not use notify but rather just wait(1) or wait(3) and then just keep the run loop going, but that won’t work as 1 or 3 milliseconds is waaay to long to be sleeping if the buffer size is 16 / 32 / 64 etc.

  • Are there any other genius techniques I don’t know about?

I have used the PerformanceCounter at various places to see where the time is spent. And it will say something like:
Average = 90 microsecs, minimum = 31 microsecs, maximum = 4804 microsecs, total = 90
I guess it is the few times it gets to 4804 I get dropouts.

Tried also to run with only one thread in my “pool” and that seems to work just as well as just doing it in the main call back. more threads = more drop outs.

Threads are running with real-time priority of course.
Same issue on OSX and Windows
Can’t get much info out of the profilers either as these are just short spikes when the threads take too long, not really a cpu use issue.

Any ideas are VERY welcome.


#2

Thread::notify() calls WaitableEvent::signal() and Thread::wait() calls WaitableEvent::wait(). On Posix/Mac these calls are protected internally with a mutex. I wonder if the Windows equivalent SetEvent()/WaitForSingleObject() do internally…


#3

Hi Martin, thanks for looking at this.
Well I meant, I have not explicitly set up any critical sections, like to move data between my threads.
But you have a point about looking at the inner workings of these functions


#4

Well, clearly, you don’t know about the issues of multithreading. The smaller the chunk, the more overhead you have. All these notify also cost up to a few microseconds, and you may end up with the congestion you are talking about with too many threads competing for the same resources.
Unless you are developing an independent app, if it’s a plugin, don’t multi-thread, you will get dropouts because you will be competing with the DAWs threads (and perhaps other plugins that are just as not bad at it as yours).


#5

Well, this is not a plugin but a standalone app, so my threads are "the daws threads"
But you are right, there are many things I don’t know.
Would it be better to meassure the time spent in the callback, and then spin up threads once I get close to the call back interval?


#6

Low latency and parallel processing with multiple threads are just not a good combination to attempt. Perhaps make it optional for the user? (Multithreaded processing with more buffering involved or low latency mode that mostly runs using one CPU?)


#7

In that case, you may want to use a thread pool like Intel TBB, far better suited for task based parallelism than a custom thread pool if you don’t know low latency multithreading enough.