GPU vs. CPU processing - what is the future for audio processing? Can we access the GPU now? If so, how?

I have programmed audio synthesisers for fun for 5-10 years now. For most synths, the CPU obviously works fine and no one is doing sufficiently complex programming that stresses the modern CPU speed.

However, if you enter the realm of engineering and physics where you are doing modeling based on real life audio physics phenomena (based on the best real world models of real instruments), a modern CPU doesn’t actually go very far anymore.

We are also stuck with the death of Moore’s Law where a single core today is not much better than 5 years ago. If we lived in the old days, we would have 10 GHz processors by now and maybe I would be happy with that. However, that has not occurred.

I was watching this video: https://www.youtube.com/watch?v=Y8Ym7hMR100

And he shows GPU vs. CPU compute speeds:

He explains that to increase CPU speed (each core) we would need to pack more transistors which is no longer possible. While with GPU we just add on more chips.

I don’t understand how GPU’s work. In layperson’s terms, what does this mean and specifically for audio? You can have a 10 or 20 core CPU, but you can only practically use one core per audio plugin because multithreading a single plugin is basically impossible (audio is linear and each sample must calculate based on the last typically).

Can a very advanced GPU serve as a “single core” to complete one linear processing thread with all its power? ie. Would “adding more chips” to GPU as he suggests benefit the linear audio plugin processing workflow in a way that more CPU cores don’t?

In theory then, could we then add 3-6 GPUs to our system and run 3-6 expensive synths on them? Kind of like how Universal Audio used to (or still does) run on PCI cards?

I am not questioning whether this is practical or necessary for the average use case, but whether it is even theoretically possible and whether you think we will see things going that route to any extent.

I see this page talking about GPU audio and they seem to suggest what I’m saying may be the case:

Does JUCE allow this and if so how? Or will it? How would programming for the GPU work compared to the CPU? How would you designate it to process there? Is this VST compatible? Or what? How does the DAW know what to do?

Thanks for any thoughts.

Addendum:
I found this nice article from NVIDIA that explains GPU vs. CPU processing a bit:

They explain that a GPU essentially “can do thousands of operations at once” and that is what makes it more efficient for things like media rendering.

Presumably then we can think of a GPU as a very very multicore CPU with good thread synchronization?

ie. Hypothetically, if you are calculating a model where you have 100-1000 iterations of a formula (each for one point of the model, not depending on each other at each timestep, kind of like how visual programs must render each pixel of the screen distinctly), then a GPU is ideal for this.

I believe in theory a GPU could calculate each sample by using its many cores, then re-synchronize for the next sample’s calculation (based on the previous finished result) in a way that a multicore CPU can’t.

Is this right?

GPUs have many cores but they will all need to run the exact same program. You know how CPUs have SIMD that lets you do operations on 4 floats at a time? GPUs are also SIMD but can work on hundreds or thousands of data items (such as pixels) at a time. This is useful if you have a problem that can be described in this manner: lots of elements (say particles) that all need to be processed using the same sequence of steps.

For example, I’ve been musing about emulating a plate reverb by using the GPU to calculate the 2D wave equation. Essentially you have a large grid of points and each point moves up or down based on the positions/speeds of neighboring points. On the CPU it’s prohibitively expensive to calculate this for 1000s of points for every sample timestep, but a GPU can do these 1000s of points in parallel.

1 Like

If you need some examples in JUCE how to use GPU in plugins, you can look here:

This is exactly what I’m talking about. Things like 1D or 2D or 3D wave equation.

In finite difference modeling (like of wave equation), you are basically performing this function over and over (with more dimensions if 2D or 3D):

//push back arrays of data points for positions that are being calculated each sample
fdmU_3 = fdmU_2;
fdmU_2 = fdmU_1;
fdmU_1 = fdmU;

//calculate new positions for each point in simulation based on prior positions
for (int i = 0; i < fdmU.Count; i++){
   fdmU[i] = ...; 
   //math calculations (multiply/add/divide) based only on prior sample's solved data 
   //ie. utilizing fdmU_1[i-1], fdmU_1[i+1], fdmU_1[i], fdmU_2[i], fdmU_3[i] etc.
}

However, you cannot split that across multiple CPU cores because each sample must be finished before the next one. There is no practical multithreading option.

CPU SIMD only goes so far (I actually tested using SIMD but it was less efficient, which people on here theorized was due to the compiler already maximally optimizing my for loop in ways that I was losing with manual SIMD inside the for loop).

So if as you describe, a GPU can run “mega SIMD” where we are still running a “single threaded operation” (ie. each time step still finishes sequentially), this is ideal.

Hypothetically if we could do a SIMD with 100 multiplications at once (for example’s sake), then we might be able to solve each time step with just a few operations total.

So then this does seem to be the solution I would be looking for. Exactly as you describe for the plate reverb as another example. 2D wave equation for what you describe at audio frequencies will crush a CPU (speaking from experience with wave equations).

But then maybe not a GPU? This is exciting. I essentially quit working on my physical model plugins because I hit a processor capacity wall on the CPU. If GPU can open it up that would be very fun.

Have you looked into or do you know how one might program something like this to be run on GPU? Would the core pseudo-code I wrote remain in tact, but instead of the basic for loop with multiplications/divides/add/subtract, one would write some “GPU mega SIMD” operations? How does that look in terms of code? Any basics on how to do this?

Thanks for any thoughts. We have been thinking about similar concepts it seems. :slight_smile:

So far my only experience with writing compute shaders has been with Metal (and a little bit of CUDA), and I don’t know if any cross-platform methods exist (except OpenCL but that has been deprecated in favor of Metal).

But essentially you write a little program that takes a piece of data (whatever that is) as input, performs computations on it, and writes a piece of data as output. Then you schedule this program to run on a grid of data objects, where the size of the grid determines how many threads will be running.

There is (was?) a great course on Udacity that explains the basics of this kind of massively parallel programming. Playlist on YouTube: https://www.youtube.com/playlist?list=PLAwxTw4SYaPm0z11jGTXRF7RuEEAgsIwH

The downside of using the GPU in real-time audio is that it takes time to schedule a job to run on the GPU, including copying data over if necessary, so you’ll have to run this from a background thread. The audio thread will communicate with this background thread through queues. This means it adds a certain amount of latency to the audio processing.

AFAIK GPU can accelerate vector & matrix operation. However, if you have to calculate a[i] based on a[i-1] and a[i-2] cascadingly (e.g., TF-II IIR filter), I am afraid that GPU won’t help you.

1 Like

I’d start by leaning NVIDIA CUDA on a Windows PC, and keep the real-time audio out of it for now. Once you have some concepts working and it’s showing benefits over the CPU, it’s not really that hard to get a real-time CUDA plug-in working with Juce (run the audio and GPU asynchronously with some double buffering) to see if the idea has legs. You may find CPU usage is a little high at low buffer sizes as you need to block on the GPU finishing, and/or getting multiple instances working well can be a challenge as each instance might start to fight for the GPU especially at low latencies. Anyway when you’ve done that, if you want to go further, maybe get in touch with GPU Audio as they believe they have solved those problems, and it’s targeting a multi-platform solution so will make Apple Silicon work in particular far easier, and their fundamentals aren’t a million miles away from CUDA conceptually based on what was shown at ADC 2022 so you can probably start to port over to their solution perhaps without a full conceptual redesign.

1 Like

I’ve found a way to use the PyTorch front-end interface to apply a filter and a reverb in the process callback. I need to do some latency and process time experiments with Perfetto, but I’ll keep you updated.

I’ve been maintaining a c++ library for differentiable audio processing using PyTorch C++, with loads of audio effect implementations that are true to our internal stock effect range. I’m currently working on a convolution reverb as well, which I think would be a great applied use case for the library.

Thanks. I have done a TINY bit of work with shaders because I have done some game development in Unity. But I can’t write one at this point casually. Since this is for my own usage, I technically don’t need cross platform support. Just Windows.

So I would presumably need to write my functional synthesiser/plugin math in HLSL presumably for OpenGL/DirectX.

I am trying to think this through. I am not sure how threading works well in C++ because I never need it.

So practically speaking, I can either:

  1. Send each single time step to the GPU and wait for it to come back (likely not going to work).
  2. Let the shader (if it exists with some “internal state”) store my data like the arrays fdmU, fdmU_1, fdmU_2 in the above example and send and receive buffers of samples to be filled with the synthesis/processing. Thus the true “state” is being held in the shader.

It sounds like #2 is how it would have to work?

I see some basic talk about how to do this with 1D textures here: opengl - Passing a list of values to fragment shader - Stack Overflow

There is some texelFetch function apparently I will need to learn more about:

https://docs.gl/sl4/texelFetch

So in JUCE would it be practical to just override the rendering function in your Synthesiser inherited voice like:

void renderNextBlock (AudioBuffer <float> &outputBuffer, int startSample, int numSamples) override {
    //send to shader perhaps one buffer at a time and return a buffer back
}

In C# I could do this part with await Task.Run(GPUFunction) but I think we don’t have async in C++. So any thoughts on how to handle that? Perhaps that is cart before horse since I don’t know how to do shaders yet, but any thoughts or pseudocode would be appreciated still to get a concept of how to do it.

IIR filters can be represented in matrix form, which allows for the computation of multiple samples per channel using SIMD instructions. While parallelizing recursive algorithms can be difficult, it’s definitely possible.

GPUs, for instance, can process numerous IIR filters simultaneously. If you have 1000s of second order sections, instead of processing them in series, you could process them in parallel, preserving both the magnitude and phase response. Generally, converting a series SOS form to a parallel SOS form is done through partial fraction expansion. However, this approach, when applied to many sections, can introduce numerical errors and decrease dynamic range.

A paper I came across describes a method to convert series SOS to a delayed parallel SOS form using a least squares fit. This technique combines an FIR part with a delayed IIR component to reduce numerical errors and improve dynamic range. Theoretically, you could apply this method to process 1000s of second order sections in parallel on a GPU. The paper also includes benchmarks showing that this GPU-based approach can significantly outperform traditional CPU-based implementations.

Here is the paper I’m referring to:

5 Likes

Evan gave an ADC talk last year about Anukari. Note the use of an Explicit Euler integrator, which is trivial to parallelize.

3 Likes

Are there some numbers on complexity, latency and operation time?

When do you really choose one over the other? I’d assume that some plots would help us find the “sweet spot.”

When performing computations on a GPU, some latency is expected. According to the paper, Figure 4 includes data transfer times factored into the plotted results. The paper shows that even with just 128 second-order sections, the GPU maintains a performance advantage over the CPU.

The paper doesn’t explicitly identify a ‘sweet spot,’ as this may vary with different hardware configurations. Instead, it focuses on the maximum number of filters that can be processed simultaneously in real time at a specific buffer size on the system they used for testing.

1 Like

Based on the linked thread here: GPU processing with opengl and compute shaders - #10 by oli1

I am looking into the basics of adding OpenCL and CUDA programming. Here are the three best references I have found on getting started with this:

The first two pertain to the OpenCL Wrapper project, which makes it much easier to use. The third is about using basic raw OpenCL.

Does anyone know one could implement the OpenCL Wrapper in a JUCE project?

In the OpenCL Wrapper GitHub page, examples are given for how much this project can simplify the OpenCL code which looks nice. But they only provide instructions for “compiling” the project. I have no idea what this provides or how to then add it to a JUCE project.

ie. They say for Windows:

Download and install Visual Studio Community. In Visual Studio Installer, add:
    Desktop development with C++
    MSVC v142
    Windows 10 SDK
Open OpenCL-Wrapper.sln in Visual Studio Community.
Compile and run by clicking the ► Local Windows Debugger button.

Okay, but then what?

Let’s say you have created a synthesiser project using the Projucer like the usual instructions. In very simple and step-by-step terms, how do we add the OpenCL project or built result (whatever that is) to it?

Any ideas? If we can work out a basic workflow we could all experiment with this perhaps and it may yield interesting results and bring the technology forward. :slight_smile:

Thanks for any thoughts.

I might be wrong, but as far I as understand,
In addition to thread synchronization, Unless you have decent abstraction such as what GPU Audio aims, each GPU is different.

Meaning, it differs in memory, cores and APIs.

From my very poor basic knowledge and review,
Assuming you have abstraction of the GPU driver (so you don’t need to take care of each GPU vendor), it seems that with GPUs you’d still need to query device (GPU) limits and ensure you’re not overloading to their limits.

btw, Another abstraction layer you can review is WebGPU.

maybe the best thing is to program it from scratch using Vulkan to have absolute control of everything?

Vulkan gives us increased control over the render process to maximize the use of both CPU and GPU resources by running many tasks in parallel. Whereas previous generation APIs were presented as if operations ran sequentially, Vulkan is explicitly parallel and built for multithreading.

For example, the GPU and CPU can run various fragment and vertex operations of the current frame and the next frame all independently of each other. By being specific about which operations need to wait on one another and which operations do not need to wait, Vulkan can render scenes with maximum efficiency and minimal wait time.

By putting CPU and GPU cores to work in tandem with the correct coordinated timing we can keep resources from idling for longer than they need to, squeezing the most performance out of the user’s system. The key is making sure that any parallel tasks wait only when they need to, and only for as long as necessary.

Be aware that OpenCL is deprecated on macOS (just like OpenGL) in favor of Metal.

Yeah, I don’t buy that Macs or iOS will truly abandon OpenGL or for any time soon, but either way, Windows is good enough for me for my purposes.

They deprecated it in 2018 and have done nothing about it since because if they removed OpenGL they would basically kill a massive amount of current mobile and desktop game availability for Mac/iOS.

Apple is an arrogant company in my opinion but not quite that stupid.

To me it seems the best current options are: OpenCL using the OpenCL wrapper for cross platform or CUDA if you want to be locked to NVIDIA. I will try adding OpenCL wrapper when I have a minute.

Update

I have made some progress with adding the OpenCL Wrapper project to a JUCE project but I am getting LINK errors as I am not sure what to do with the .lib file within it.

Any ideas or solutions? I summarized my method and results so far in this new thread. Thanks.

Got the OpenCL Wrapper for JUCE working. :slight_smile:

I posted my full workflow here:

It should take anyone less than 30-60 min to copy my steps and similarly get a test running.

3 Likes