Multi core support

nikolaiaudios · December 16, 2012, 6:07pm

Hi
I’m still looking through the juce documentation/examples trying to decide if the framework is the way to go for my next project.
One major point for me is, how does juce support multiple cpu cores?
Like the AudioProcessorGraph, will it process the entire graph inone thread or will it split it up and render paralell connections in multiple threads.
If not what are my options on doing this?
I would like to host lots of plugins many of them connected to each other so running the whole thing on a single thread is not a great idea…

chkn · December 16, 2012, 6:36pm

I’ve done a multiprocessor version of audioProcessorGraph but its not entirely ready ( no PDC and Midi yet)

dinaiz · December 17, 2012, 9:46am

Interesting ! Do you plan to release the source or integrate it with juce ? Did you notice some improvements in terms of speed ?

nikolaiaudios · December 19, 2012, 7:32pm

That sounds great! If it would be any motivation, I’m willing to pay you to complete this, given that it does what it should.
Since you probably know the graph way better then me, can you do stuff like “route back”?. An example
Send output 1 of node A to both node B and node C then send the output of node C back to input 8 of node A.
Same goes for midi…
Also what is your sceem for deviding the threads? I mean how is it split up, how do you decide the number of threads.

Thank you for your help
Nikolai

TheVinn · December 19, 2012, 7:35pm

[quote=“Nikolai”]Since you probably know the graph way better then me, can you do stuff like “route back”?. An example
Send output 1 of node A to both node B and node C then send the output of node C back to input 8 of node A.[/quote]

This makes no sense to me…at that point you will no longer have an acyclic graph. How would this even work? There would have to be a delay, or else you’d have an infinite loop.

nikolaiaudios · December 19, 2012, 8:24pm

[quote=“TheVinn”]
This makes no sense to me…at that point you will no longer have an acyclic graph. How would this even work? There would have to be a delay, or else you’d have an infinite loop.[/quote]
Ah I forgot to mention that part, yes there will ofcourse be a delay of 1 buffer.
I have done this in my own “framework”, so I need this in order to port to juce.
I just did it pretty simple, just add(mix) the the output of node C to the input of node A when C is processed.
In the next cycle/buffer, the other inputs are added to node A
So the output of C is mixed in to A with a "one-buffer delay"
Don’t know if that explained it?
Also i guess your replay indicates that this is not possible in juce as of this time

TheVinn · December 19, 2012, 8:29pm

Its entirely possible, you just need to do it using two nodes which share the same internal object. So in your example, the output of C would go to the input of D. There would be a new node E which feeds into input 8 of A.

Internally, nodes D and E point to the same implementation object, which keeps a delay buffer and applies whatever effects you want.

JUCE must be presented with an acyclic graph for the audio processor graph to work correctly, so its up to you to format your nodes in a way that the acyclic property is preserved.

nikolaiaudios · December 19, 2012, 8:37pm

Thanks TheVinn, Thats great.
I think I undestood it.
It’s a bit hard to wrap my head around this when I’m so used to my old framework.
Thanks for helping me out

dinaiz · December 20, 2012, 2:53am

I’m afraid that having several Nodes to refer to the same AudioProcessor wouldn’t work because, the Node object has the ownership of the underlying processor :

In class Node (juce_AudioProcessorGraph.h) :

The audio processor graph deletes and create nodes at is sees fit, so when 2 nodes reference the same processor, it can be deleted at any time, and become a dangling pointer in the other node.

Jules, maybe it would help to replace the ScopedPointer by another, non-owning pointer(Regular, ref counted, whatever you see fit) to allow us more advanced manipulations with AudioProcessorGraph

TheVinn · December 20, 2012, 6:09am

Of course it would work, just make an AudioProcessor subclass that holds a reference to a separately allocated implementation object. Then you can easily have 2, 3, or any number of AudioProcessor objects which refer to the same implementation.

Actually looking over the AudioProcessor interface, there’s quite a bit to the interface. It might not be as simple as I thought.

Making it so that the same AudioProcessor instance can live in two or more different Node objects is not going to be particularly helpful, because in processBlock() you have no way of knowing for which node you are being called. And there’s no realistic way of adding this information, because plugins could never take advantage of it.

I think that in order to make this work it would be necessary to write a medium size AudioProcessor subclass that specifically addresses the use-case of being able to be hooked into multiple locations within the graph.

TheVinn · December 20, 2012, 6:37am

Okay looking over AudioProcessor, I think it is possible to make this class:

/** Creates a delayed cycle in the AudioProcessorGraph.
*/
class CircularAudioProcessor : public ReferenceCountedObject
{
public:
  CircularAudioProcessor (OptionalPointer <AudioProcessor> audioProcessor, int numInputs, int numOutputs);
  AudioProcessor* createInput ();
  AudioProcessor* createOutput ();
  //...
};

The createInput() function will return a pointer to an internally created AudioProcessor object that can be inserted into the AudioProcessorGraph to receive output from a node. The node takes ownership of the pointer (the implementation maintains a reference to the CircularAudioProcessor).

Similarly, the createOutput() function returns a pointer to an internally created AudioProcessor object that can be inserted into the AudioProcessorGraph to send its output to the input of another node.

The constructor of CircularAudioProcessor takes the AudioProcessor object which you provide, that will perform the filtering. The inputs and outputs are passed to the underlying filter. The implementation of CircularAudioProcessor will have to create two instances of subclassed AudioProcessor that has stubs which collected the data from the filter and provide it on the appropriate input/outputs.

chkn · December 20, 2012, 10:25am

When its ready, i will make it open.

Yes, dramatically, but that was something i expected, we have modern 4/6-core processers now, and the graph only use one

No, this would be only possible if we introduce one blocksize latency in the feedback-path.
If you want a zero-delay feedback path, i remember there was a thread about this in the kvr-develper forum, but this is a complete different topic…

My AudioProcessorGraph uses the same methods but has a different internal design, thats why i requested abstract interface for AudioProcessorGraph, a while ago (will send it to jules when its ready)

TheVinn · December 20, 2012, 2:56pm

Isn’t making a multi-core implementation of the existing JUCE AudioProcessorGraph a straightforward task? It seems that one could just go into buildRenderingSequence and tag each rendering op in the ordered nodes array with a single integer, which is the count of subsequent inputs which depend on the output of that node.

Once you have that information you can just change the for loop in AudioProcessorGraph::processBlock to a series of ParallelFor loops based on the counts.

The current implementation of AudioProcessorGraph already sorts the nodes by order of dependency, i.e. it already forces nodes to render before other nodes when their output is needed. This is more than half of the information we need to go muli-core. Or am I missing something?

chkn · December 20, 2012, 5:39pm

There might be structures, where a “series of Parallel-For loops” will waste time. Image you have two lines, one has one node, one has two nodes, both end in one output.

___a______+___
          /
___b__c__/

“a” and “b” might be processed parallel, but if you put “c” in the next parellel-For in series, it will be processed after a and b.
But what happens when processing “a” takes much more time then processing b, you will loose time, because c can be processed earlier.

So its also important to know how long processing of a node will take, and this is something you can’t pre-caculate.

I’m using a more self organizing (brute-force) structure, that constantly checks if node is ready, and starts other nodes which have enough ready inputs. The might add a little organizing overhead, but in the other hand its has less “not used” processing time.

chkn · December 20, 2012, 5:55pm

mhhh not sure if my example is a good one, cause the Rendering-Sequence might be sorted differently, so that “a” and “b”+“c” are in separate parallel-For loops, but anyway you might get the point, that also the time a processing takes is important.

TheVinn · December 20, 2012, 6:14pm

I think its safe to assume that processBlock for each AudioProcessor executes in roughly the same amount of time.

Yes, I see what you mean now. It is not possible to group “runs” of nodes in the sorted list, because of the example that you pointed out. The order could be:

b, c, a, +

b,c,a cannot be processed simultaneously. Only b+a can be processed in parallel, which breaks the “consecutive nodes” concept.

Okay, thinking about it a little bit more perhaps it could be implemented by having a “ready for processing” stack. When a thread in the thread group needs work, it grabs the next node off the stack. When a thread is finished with the node, it decrements an atomic counter for each of its output nodes representing “the number of inputs remaining to be processed for that output node.” If the atomic decrement reaches zero (only one thread will see this) then the output node is pushed onto the stack as “ready for processing”.

So here’s an algorithm:

Initialize stack to “empty”
For each Node in the sorted list:
- set Atomic inputsRemaining = numInputs ()
- if (inputsRemaining == 0) then push (Node) to stack
Worker thread logic:
- pop Node from stack, call processBlock()
- for each outputNode: inputsRemaining.decrement ()
- if (outputNode.inputsRemaining==0) then push (outputNode) to stack

The initialization runs in O(n) on nodes but that can be eliminated by resetting inputsRemaining after a Node’s processBlock() completes.

I guess this is just a streamlined equivalent of what you were saying about “checking for a ready node.”

chkn · December 20, 2012, 6:31pm

a reverb or physical modelling plug-in may use 1000x more cpu than a delay, i think there is nothing you can assume

I am using just a number of threads (numCPUs-1) and the callback thread itself, they just iterating through all nodes, (using a atomic++ to get the next node to process)
That might use 0,0000001% more CPU-power than a pre-sorting algorithm, but… who cares

TheVinn · December 20, 2012, 6:33pm

Well, what I mean is that it doesn’t matter to the algorithm if a plugin consumes much more time than another plugin, the order of processing is still the same.

chkn · December 20, 2012, 6:42pm

yes and no, the word “order” implicit that there is a serial structure. So its true for the serial part, but if have run more serial parts in parallel, the “order” of processed nodes is unpredictable

Topic		Replies	Views
Multicore Support Audio Plugins	17	1345	July 6, 2011
AudioProcessorGraphMulticore thoughts General JUCE discussion	9	764	June 8, 2012
Multicore Audio Graphs General JUCE discussion	1	538	November 4, 2015
Multithreaded AudioProcessorGraph Source-Code General JUCE discussion	9	1431	December 2, 2015
MultiProcessor-AudioProcessorGraph is easy to implement General JUCE discussion	3	553	December 13, 2012

Multi core support

Purchase

Discover

Learn

Support

About

Events

Multi core support

Related topics

Purchase

Discover

Learn

Support

About

Events