AudioProcessorGraphMulticore thoughts


#1

There some threads about the same topic, http://www.rawmaterialsoftware.com/viewtopic.php?f=8&t=7214&hilit=multicore

So we have a Graph like this:

[attachment=2]mc1.png[/attachment]

We have to find the global order how the processors can operate. A algorithm finds all possible paths from “out” to “in” and counts every processor along its path, and mark this number to the processor. Higher numbers will overwrite lower numbers.
Something like this is think is already implemented in AudioProcessorGraph.

So we have something like this:
[attachment=1]mc2.png[/attachment]

Of cause feedback loops are not permitted, for simplicity reasons, also for Plugin Delay Compensation, the algorithm has to add compensation processores which add latency.

Now we can make a list, which operation requires other operations.
And we can calculate how many of this requirements a operation has, and which followers one operation has.

[attachment=0]mc3.png[/attachment]

When ever a operation is done, it will decrease the “Required Counter” of the followers.
If the „Required Counter“ of a operator goes to 0, it will start (from a Thread-Pool) to do this operation (or adds them to a fifo list, if all threads
are used at the moment)

Example:
Operator IN is 0 so it will begin:
After IN is done, it decrease the counters of
B A and C.
B and A and C’s required Counter goes to zero, 3 new operations will begin, and so on.
If a “required Counter” goes to 0 and the operation is done it will automatically reset to its inital number

Yes, it will be problematic if we want to work with live audio-input, so we have to
-add 1 one buffer extra latency (like a UAD Card), that allows us to have enough time-slices for the threads between the Audio-Callbacks
or
-calculate all path with live-audio input directly in the audio-callback (but then we have no multicore support)

Feel free to comment!


#2

[quote=“chkn”]Yes, it will be problematic if we want to work with live audio-input, so we have to
-add 1 one buffer extra latency (like a UAD Card), that allows us to have enough time-slices for the threads between the Audio-Callbacks
or
-calculate all path with live-audio input directly in the audio-callback (but then we have no multicore support)[/quote]

It sounds like you might be wanting to create an engine out of a graph. I was initially looking to the graph to meet these types of needs but I now think it is the wrong approach. If you have a specific topology for your system then I’d suggest creating the structures for that and process accordingly (possibly using smaller graphs within processing pipelines if desired). This way you have full control over what you want to process and how and when you want to process it. You also get the added benefit of having a legible system to analyze and profile at run time. Flexible, sensible and simple.

If your intentions are purely to optimize the graph then it’d be interesting to see what you come up with and your uses for it. But if your intentions are to use it as the basis for all your audio processing needs then I’d suggest considering the advice above.

Just my 2 cents of course! :slight_smile:


#3

I’ve been thinking about this some more and I’m wondering if using a high level graph to partition out your processing sections wouldn’t do the trick. You would wrap each section in a processor and decide in there whether you want that data processed on another thread or not. This would allow you to easily tie in the threaded stuff with live callback stuff.

This of course is still assuming you have predetermined processing lines. If the configuration is purely up to the user then I can only gather you’d have to work the threading into the graph code itself.


#4

I might be picking your algorithm up wrong but it sounds similar to this paper: http://tim.klingt.org/publications/tim_blechmann_supernova.pdf

It’s a multi-threaded processing graph as well. I think the main thing is to do a topological sort to find the dependencies between nodes, maybe some graph colouring to find the minimum amount of buffers needed and then figure out how you’re going to manage the threads. Probably a custom memory manager for the audio buffers would be a good idea as well to improve cache usage.

It seems like latency is going to be a problem though. Also I’m sure there are lots of node configurations which will end up being slower when run in parallel. I’m interested in helping but I have a lot to learn I think. It seems like a difficult task :smiley:


#5

yes, thats what i want, i use AudioProcessorGraph for hosting Plugins and i want to use mutiple cores

yes of cause, you always have overhead, but if you have for example two processors which are to heavy to run both in one audio-callback (one after another), the first one can be processed directly in the audio-callback, the second one is processed in a separate thread.

yes, but not impossible (most DAWs using multiple cores these days)


#6

Any more news for this chkn?


#7

Hm, I might as well ask in here, I’m tackling with something similar right now.

I intend to use ThreadPoolJob objects in a ThreadPool inside an outer AudioProcessor’s processBlock() method to start several worker threads for parallel execution of
audio processing chains inside the processor.
Now, my question here is about the behaviour of the ThreadPool class implementation, since the documentation is a bit ambiguous about that.
I’d like to keep the actual ThreadPoolJob instances around and reuse them in the following processBlock() calls, since they shouldn’t change very often;
not in the ThreadPool documentation it says the ThreadPool will delete finished jobs, at other places in the same page it says they’re removed.

Can someone please elaborate on the exact behaviour of the ThreadPool?
Will it delete the passed instances, or will it simply remove them from it’s internal job list and leave the actual job instance untouched?
I would go on and test this myself, but unfortunately I currently lack the time to do so, but I’d like to have this issue resolved so I can plan ahead better on the
architecture of my multithreading audio experiment… so, someone know the answer to this? Thanks in advance!


#8

Chkn, you approach seems fine !
Maybe Greame’s idea, to have “graphs of graphs”, and thread pools (which, I believe, would minimize overhead) would make it simple enough to implement without having to modify too much juce’s code, don’t you think ?


#9

Any news here ? First, does anybody need this feature actually ? :smiley:


#10

Ok, although it seems that it’s not a popular feature request, I wanted to give it a try. Maybe I can get a few people interested ? Or even better, a couple of multithreading gurus :slight_smile:

Here we go :

Prerequisites :
[list]
[]Create class ParallelAudioProcessor which both inherits and warps an AudioProcessor. Inherits also ThreadPoolJob[/]
[]Create a thread pool with a sensible size [/]
[]The AudioProcessorGraph will contain only those kind of nodes (possible because they inherit AudioProcessor)[/]
[]For each of those, you get a way to know it’s parent graph nodes [/]
[]You keep a pointer to the refs passed to processBlock, let’s call them AudioSampleBuffer& mInputAudioBufferPtr and MidiBuffer &* mInputMidiBufferPtr[/]
[
]Check that the user’s processor has at least 2 cores of course :slight_smile: (otherwise : fallback to the monothread solution) [/*][/list]

Algorithm in C++/pseudo-code

[code]ParallelAudioProcessor::processBlock(AudioSampleBuffer &buffer, MidiBuffer &midiMessages)
{
for each parent :
mThreadPoolRef.waitForJobToFinish(parent, sensible timeout );

mInputAudioBufferPtr=&buffer;
mInputMidiBufferPtr=&midiMesages;

mThreadPoolRef.addJob(this, false);
}

ParallelAudioProcessor::runJob( )
{
mWrappedProcessor.processBlock(*mInputAudioBufferPtr,*mInputMidiBufferPtr);
return ThreadPoolJob::jobHasFinished;
}
[/code]

Now, this code is probably bogus, and full of race conditions. I write that late, and my point is just to show how this would work globally
However, if it works, this approach woudl have many advantages :
[list]
[]It seems that it would be the simplest solution that works for people who have many-core computers (the latest intel core have 6 cores / 12 virtual cores !)[/]
[] We aren’t touching juce codebase at all [/]
[]Besides I was worried about threads overhead, but it seems that it’s neglectible, except for thread creation/deletion, which doesn’t matters using a ThreadPool[/]
[]There’s no lock on data . The only paralel processing would work on independant data. [/][/list]

The only problem is I have absolutely no idea if it would work. It seems to simple to be true TBH. I’m far from being multithread guru …
Besides, I didn’t dive into AudioProcessorGraph code to see if it would cope with such a design

Next step : try it in real life :slight_smile:

As usual, call comments welcome