Multicore Support

Hi Jules,

your AudioProcessorGraph is really a great tool. Looking at some 20-50 or more plugins being hosted, I wonder if it would make sense to distribute the render work of a graph across multiple CPU cores.

As it is implemented now, the entire chain of render ops is executed on a single thread. For a big graph with many plugins, this could get tight, i.e. the whole chain might not complete within the render block time slice.

In theory, the graph could be splitted into parallel chains, each of which is rendered on a separate thread (thread pool) running on a different CPU core. The final result is then merged from the output buffers of each thread.

As straightforward as this sounds, it is likely a lot more complex. I would expect the hardest part to be the intelligent splitting of the graph, but that could be done manually (programmatically) to some extent. In a typical mixer metaphor, one could split the graph roughly based on channel strips, for example.

Questions:

  • Are sub-graphs the best way to implement this, or are there performance penalties that suggest it could be better to implement a specialized graph class?

  • How would I make sure that threads in a pool are equally distributed across CPU cores?

  • Am I completely idiotic, missing the most basic things and dreaming of impossible things?

Any thoughts are welcome!

Andre

It’s a complicated problem, and I’ve never had chance to think it through very carefully.

The OS will probably do a pretty good job of that, but you could play around with affinity settings if necessary.

No, it’s definitely harder than it sounds, but not impossible!

the simplest solution would be, run every plugin processing-routine in its own separate thread, and connect the audio streams via FiFo-Buffers together

another option is to find a algorithm which creates a plan, which respects which plugin can be run serial or parallel to other plugins, and then run the plan for every block
(somehow i like the first idea, because its so simple, but maybe it has a little overhead)

and of cause on knot-points, the FiFo has to wait that all samples for a given position are arrived, to proceed

Each AudioProcessor on its own thread is certainly simple but not optimal, because the communication and synchronisation overhead would be huge.

The graph is a pull model. It should not be too difficult to come up with an algorithm that divides a graph into subgraphs that may run in parallel. I have a mixer metaphor in mind, so my thinking might be too simple, but it is definitely a solvable problem. One might run into issues with a lot of side chains and such.

It boils down to “find the maximum number of distinct paths (partial, the longer the better) backwards from the output that do not meet”. Render everything else, starting from the graph input, on the main render thread first, then fan out to the other threads and wait at the output for all results to arrive - done.

Of course, this may result in a network that costs more performance than a straight render sequence, so there must also be a validation function that estimates the total cost of each suggested solution. If there is any solution that is better than the straight render sequence, it is taken.

Pretty straightforward generate-and-test decision making algorithm.

He he, sounds tempting, doesn’t it? :twisted:

Ah, btw: The audio buffers are so small (512 samples, typically), they can be passed between threads in one piece. I would not use sample queues.

yes, or just switching the pointer, but not if you do something like sample-exact automation

Another thing, if we use one thread per plugin, also serial plug-in chains would benefit from multi core processing :wink:

It might be instructive to consider how Reaper implements anticipatory FX processing. I haven’t been able to track down much info on it other than this quote from SoS magazine:

[quote=“ans”]Each AudioProcessor on its own thread is certainly simple but not optimal, because the communication and synchronisation overhead would be huge.
[/quote]

I can confirm that too many threads would be counter productive. Pool approach is the best approach. Especially if you fine tune the number of simultaneous threads according to the hardware architecture. Best being of course 1 thread per core and keeping 1 core available for synchronisation work.
Thread affinity is the most efficient if you have a NUMA (http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access) compliant machine. I think OSes like Solaris support that by default. Linux needs libnuma. I don’t know about the other systems but seeing OSX’ Grand Central Dispatch, I’d say you better let the OS decide.

I like this idea.
What if the paths meet? Would you process everything on the main thread, then only split once you passed the meeting node?

What if graph topology changes during playback? I guess the heuristic computation shouldn’t be triggered in such cases?

And if different threads work on different buffers, we wouldn’t even need any locking, would we?

I like such crazy ideas! :twisted:

That would be for sure a must-have for everybody developping hosts (as opposed as developping plugins).
I suppose Jules doesn’t have time at all to design something like this, but there are quite a few host developpers on this forum. What if we tried to collaborate together, under the supervision of our fearless leader (just to make sure we aren’t changing code he is also changing) and try to come up with a working solution ?

We could try to design, code and test a MultithreadedAudioProcessorGraph and submit it to Jules for inclusion in Juce( maybe trying to address other issues in the AudioProcessorGraph, for example -correct me if I’m wrong - the fact that it doesn’t take in account plugins latencies ?

We would have an AudioProcessorGraph on seteoroids then, multi-core ready and that’d be a F**** good audio engine !
What do you guys think ?

I totally agree with you about the pool approach, but I’m not sure about the number of threads in the pool. There is the high priority audio thread, but also the message thread, so that leaves us numberOfCores-2 threads for plugin processing. Starting from 4 cores, that would be interesting already !

[quote]We would have an AudioProcessorGraph on seteoroids then, multi-core ready and that’d be a F**** good audio engine !
What do you guys think ?[/quote]

I think it’s a great idea!

The main problem I see is that, if we want it to be integrated in Juce someday, we can’t use Boost or TBB or any 3rd party lib …
It should be doable in 100% juce though. Who’s in ? (The vinn : hint :wink: )

Unfortunately my time is very limited, but I’d love to participate in the discussion for sure. Basically the problem is dividable into three areas:

  1. Dividing a graph into parallel subgraphs, both algorithmic and manual. IMO it is very important to offer a manual API, if only for testing.
  2. An efficient mechanism for passing buffers between threads (lock-free, allocation-free, whatnot…)
  3. Ensuring that all threads really stay separate and do not get in their way somehow by using shared data that spoils the parallelism.

I would also not expect too much a performance gain except for very large graphs and 4 or more CPUs, but I may be totally wrong. My 8-Core Mac would certainly be very happy at least :wink:

In the short term, I’d however be more interested in delay compensation. Timing issues are more obvious, because they occur with a minimal number of plugins already can be quite confusing for the end user.

Maybe this thread should be merged with that one : http://www.rawmaterialsoftware.com/viewtopic.php?f=2&t=7344&p=41380&hilit=multicore#p41380 and (the end of) that one : http://www.rawmaterialsoftware.com/viewtopic.php?f=2&t=7020&hilit=PDC&start=75 ?

Apparently, plugin delay compensation is already implemented actually !

It seems that nobody has really time on his hands to deal with this at the moment.

I really suck at multithreading, but if someone can provide a simple design, I’ll be happy to write the code (the stupid part, where I’m good at :wink: )

+1. Sorry guys, but I’m a musician and I’m facing multithreading for the first time just now :roll:

Wonderful. I didn’t notice until yesterday. That’s a great achievement.

Have you guys seen that ? http://www.boost.org/doc/libs/1_41_0/libs/graph_parallel/doc/html/index.html

It would probably be possible to build an MultithreadedAudioProcessorGraph on the top of that. I suppose the dependancy on boost makes it not suitable for inclusion in Juce, but we could add it to “Useful Tools and Components”. What do you think ?

I doubt that a general graph library like this would be very helpful here. Porting a generic graph algorithm to Juce is also not a problem. The real problem is to know exactly which portions of an audio graph make sense to be separated from each other. A collection of general graph tools can not answer this question. Only we can, knowing all the details about the inner workings of an audio graph.

You’re absolutely right, BUT, I see it the other way round : with this library, the only remaining problem is to know exactly which portions of an audio graph make sense to be separated from each other which we can do, because we know all the details about the inner workings of an audio graph.

Starting from AudioProcessorGraph, would make us have to deal with all the common mutithreading issues, and we wouldn’t be sure it always works . In boost, the work has been done and it’s probably very reliable :slight_smile: . Well IMHO …