Multi core support

I’m still looking through the juce documentation/examples trying to decide if the framework is the way to go for my next project.
One major point for me is, how does juce support multiple cpu cores?
Like the AudioProcessorGraph, will it process the entire graph inone thread or will it split it up and render paralell connections in multiple threads.
If not what are my options on doing this?
I would like to host lots of plugins many of them connected to each other so running the whole thing on a single thread is not a great idea…

I’ve done a multiprocessor version of audioProcessorGraph but its not entirely ready ( no PDC and Midi yet)

Interesting ! Do you plan to release the source or integrate it with juce ? Did you notice some improvements in terms of speed ?

That sounds great! If it would be any motivation, I’m willing to pay you to complete this, given that it does what it should.
Since you probably know the graph way better then me, can you do stuff like “route back”?. An example
Send output 1 of node A to both node B and node C then send the output of node C back to input 8 of node A.
Same goes for midi…
Also what is your sceem for deviding the threads? I mean how is it split up, how do you decide the number of threads.

Thank you for your help

[quote=“Nikolai”]Since you probably know the graph way better then me, can you do stuff like “route back”?. An example
Send output 1 of node A to both node B and node C then send the output of node C back to input 8 of node A.[/quote]

This makes no sense to me…at that point you will no longer have an acyclic graph. How would this even work? There would have to be a delay, or else you’d have an infinite loop.

This makes no sense to me…at that point you will no longer have an acyclic graph. How would this even work? There would have to be a delay, or else you’d have an infinite loop.[/quote]
Ah I forgot to mention that part, yes there will ofcourse be a delay of 1 buffer.
I have done this in my own “framework”, so I need this in order to port to juce.
I just did it pretty simple, just add(mix) the the output of node C to the input of node A when C is processed.
In the next cycle/buffer, the other inputs are added to node A
So the output of C is mixed in to A with a "one-buffer delay"
Don’t know if that explained it?
Also i guess your replay indicates that this is not possible in juce as of this time

Its entirely possible, you just need to do it using two nodes which share the same internal object. So in your example, the output of C would go to the input of D. There would be a new node E which feeds into input 8 of A.

Internally, nodes D and E point to the same implementation object, which keeps a delay buffer and applies whatever effects you want.

JUCE must be presented with an acyclic graph for the audio processor graph to work correctly, so its up to you to format your nodes in a way that the acyclic property is preserved.

Thanks TheVinn, Thats great.
I think I undestood it.
It’s a bit hard to wrap my head around this when I’m so used to my old framework.
Thanks for helping me out

I’m afraid that having several Nodes to refer to the same AudioProcessor wouldn’t work because, the Node object has the ownership of the underlying processor :

In class Node (juce_AudioProcessorGraph.h) :

The audio processor graph deletes and create nodes at is sees fit, so when 2 nodes reference the same processor, it can be deleted at any time, and become a dangling pointer in the other node.

Jules, maybe it would help to replace the ScopedPointer by another, non-owning pointer(Regular, ref counted, whatever you see fit) to allow us more advanced manipulations with AudioProcessorGraph

Of course it would work, just make an AudioProcessor subclass that holds a reference to a separately allocated implementation object. Then you can easily have 2, 3, or any number of AudioProcessor objects which refer to the same implementation.

Actually looking over the AudioProcessor interface, there’s quite a bit to the interface. It might not be as simple as I thought.

Making it so that the same AudioProcessor instance can live in two or more different Node objects is not going to be particularly helpful, because in processBlock() you have no way of knowing for which node you are being called. And there’s no realistic way of adding this information, because plugins could never take advantage of it.

I think that in order to make this work it would be necessary to write a medium size AudioProcessor subclass that specifically addresses the use-case of being able to be hooked into multiple locations within the graph.

Okay looking over AudioProcessor, I think it is possible to make this class:

/** Creates a delayed cycle in the AudioProcessorGraph.
class CircularAudioProcessor : public ReferenceCountedObject
  CircularAudioProcessor (OptionalPointer <AudioProcessor> audioProcessor, int numInputs, int numOutputs);
  AudioProcessor* createInput ();
  AudioProcessor* createOutput ();

The createInput() function will return a pointer to an internally created AudioProcessor object that can be inserted into the AudioProcessorGraph to receive output from a node. The node takes ownership of the pointer (the implementation maintains a reference to the CircularAudioProcessor).

Similarly, the createOutput() function returns a pointer to an internally created AudioProcessor object that can be inserted into the AudioProcessorGraph to send its output to the input of another node.

The constructor of CircularAudioProcessor takes the AudioProcessor object which you provide, that will perform the filtering. The inputs and outputs are passed to the underlying filter. The implementation of CircularAudioProcessor will have to create two instances of subclassed AudioProcessor that has stubs which collected the data from the filter and provide it on the appropriate input/outputs.

When its ready, i will make it open.

Yes, dramatically, but that was something i expected, we have modern 4/6-core processers now, and the graph only use one

No, this would be only possible if we introduce one blocksize latency in the feedback-path.
If you want a zero-delay feedback path, i remember there was a thread about this in the kvr-develper forum, but this is a complete different topic…

My AudioProcessorGraph uses the same methods but has a different internal design, thats why i requested abstract interface for AudioProcessorGraph, a while ago (will send it to jules when its ready)

Isn’t making a multi-core implementation of the existing JUCE AudioProcessorGraph a straightforward task? It seems that one could just go into buildRenderingSequence and tag each rendering op in the ordered nodes array with a single integer, which is the count of subsequent inputs which depend on the output of that node.

Once you have that information you can just change the for loop in AudioProcessorGraph::processBlock to a series of ParallelFor loops based on the counts.

The current implementation of AudioProcessorGraph already sorts the nodes by order of dependency, i.e. it already forces nodes to render before other nodes when their output is needed. This is more than half of the information we need to go muli-core. Or am I missing something?

There might be structures, where a “series of Parallel-For loops” will waste time. Image you have two lines, one has one node, one has two nodes, both end in one output.


“a” and “b” might be processed parallel, but if you put “c” in the next parellel-For in series, it will be processed after a and b.
But what happens when processing “a” takes much more time then processing b, you will loose time, because c can be processed earlier.

So its also important to know how long processing of a node will take, and this is something you can’t pre-caculate.

I’m using a more self organizing (brute-force) structure, that constantly checks if node is ready, and starts other nodes which have enough ready inputs. The might add a little organizing overhead, but in the other hand its has less “not used” processing time.

mhhh not sure if my example is a good one, cause the Rendering-Sequence might be sorted differently, so that “a” and “b”+“c” are in separate parallel-For loops, but anyway you might get the point, that also the time a processing takes is important.

I think its safe to assume that processBlock for each AudioProcessor executes in roughly the same amount of time.

Yes, I see what you mean now. It is not possible to group “runs” of nodes in the sorted list, because of the example that you pointed out. The order could be:

b, c, a, +

b,c,a cannot be processed simultaneously. Only b+a can be processed in parallel, which breaks the “consecutive nodes” concept.

Okay, thinking about it a little bit more perhaps it could be implemented by having a “ready for processing” stack. When a thread in the thread group needs work, it grabs the next node off the stack. When a thread is finished with the node, it decrements an atomic counter for each of its output nodes representing “the number of inputs remaining to be processed for that output node.” If the atomic decrement reaches zero (only one thread will see this) then the output node is pushed onto the stack as “ready for processing”.

So here’s an algorithm:

  1. Initialize stack to “empty”
  2. For each Node in the sorted list:
    • set Atomic inputsRemaining = numInputs ()
    • if (inputsRemaining == 0) then push (Node) to stack
  3. Worker thread logic:
    • pop Node from stack, call processBlock()
    • for each outputNode: inputsRemaining.decrement ()
    • if (outputNode.inputsRemaining==0) then push (outputNode) to stack

The initialization runs in O(n) on nodes but that can be eliminated by resetting inputsRemaining after a Node’s processBlock() completes.

I guess this is just a streamlined equivalent of what you were saying about “checking for a ready node.”

a reverb or physical modelling plug-in may use 1000x more cpu than a delay, i think there is nothing you can assume :wink:

I am using just a number of threads (numCPUs-1) and the callback thread itself, they just iterating through all nodes, (using a atomic++ to get the next node to process)
That might use 0,0000001% more CPU-power than a pre-sorting algorithm, but… who cares :wink:

Well, what I mean is that it doesn’t matter to the algorithm if a plugin consumes much more time than another plugin, the order of processing is still the same.

yes and no, the word “order” implicit that there is a serial structure. So its true for the serial part, but if have run more serial parts in parallel, the “order” of processed nodes is unpredictable :wink: