Hope this is the right forum to ask:
Did you guys ever use OpenCL in order to speed up audio algorithms? (Yes, OpenCL, not OpenGL)
I mean: modern graphic cards are very good at processing a lot of data in parallell. And you can program them directly, using OpenCL. I guess that should be beneficial for certain audio algorithms. (all algorithms which can be parallelized)
I am asking, because I am currently looking into OpenCL myself.
I did a lot of research about this in the past and came to the conclusion that the reason nobody’s using OpenCL/CUDA in audio plugins (or audio software in general) is because the latency of transferring the buffer to the GPU, processing, then sending it back is actually often slower compared to just throwing it at a SIMD accelerated CPU operation.
Compare this to areas where you see GPU DSP a lot, mostly for image processing in video editors - this is because image DSP operations on the CPU for a multi-million pixel image with 4 10-bit color channels are going to be much slower compared to transferring the chunk of data to the GPU (slow), processing it (crazy fast) and sending it back (slow again). In this case, the transfer latency is acceptable because the processing time for such a huge data set is far faster on the GPU than CPU.
Another reason is the belief that for whatever reason music software has to support ancient OSes and hardware - Windows XP doesn’t support OpenCL, nor do pre-2008 Macs.
thanks for your input.
Doesn’t sound too incouraging
Most (audio) DSP operations are also inherently serial, not parallel.
I can agree absolutely on what jonathonracz already wrote: While the actual calculation on the GPU can be very much faster, the overhead of copying the data to the GPU, invoking your kernel, and finally copying the data back from the GPU can be very expensive.
Furthermore, as you need to do this in your audio callback, there’s quite some danger of drop-outs:
Communication with the GPU involves some locking (e.g. inside your API like OpenCL or CUDA, but also in the graphics board driver, etc.), and locking in the audio callback is bad.
On almost all currently available consumer graphic boards, only one kernel can run on the GPU at the same time, i.e. your kernel might be queued for an uncertain amount of time until your kernel gets executed.
In the last couple of years I tried quite a lot in the field of GPU computing, and had also some fails because of the reasons above:
I’ve tried to implement a realtime convolution algorithm in OpenCL, but it didn’t work for me, and my already existing CPU implementation (partitioned convolution with non-uniform block sizes) was much faster and much more stable regarding realtime safety at low audio buffer sizes like 64 samples (however, it seems that there are also successful projects like this, but I’m not sure whether this really works at 64 samples…).
For dose calculation in some radiotherapy software, we needed to perform a lot of 2D FFTs of sizes like 128x128. We’ve tried cuFFT and C++ AMP FFT, but using something like Ooura or Intel MKL was much faster because of the expensive CPU-GPU/GPU-CPU transfers (however, the FFTs were only a part of this algorithm, and the rest was on the GPU up 100x faster).
Be interesting to put some numbers on the latency as audio problems suitable for GPU involve many calls per sec of small amounts of data.
So it is really bound by the call latency rather than bandwidth.
I’d also be interested to know to what extent this is because of driver overhead and how much is just the limitations of the bus architecture and such (e.g. would a Vulkan Compute as opposed to OpenCL change the story at all?). Also, even if it might be problematic for a single plugin, what about an entire application where you’re mixing and resampling and post-processing dozens of tracks at once (and therefore can get some benefit from the massive parallelization).
I’ve at least heard of a convolution reverb that uses GPU acceleration, but that’s probably a very unique case, and I wouldn’t be surprised if they don’t see much advantage on a relatively powerful CPU as opposed to almost any GPU.
I will put some numbers on the latency.
In a world where plugins use GPUs the wild west of them all going to the GPU independently would be interesting for both stability and performance.