GPU processing with opengl and compute shaders

This is not intended for realtime audio, but can be useful for computations in the editor or precomputing tons of data quickly, relieving the processor of all the work. .

compute shaders are not very different from normal shaders, it allows you to work with buffers, either for reading data or writing the results. for example to compute the DFT

    const char computeShader[] =
    R"COMPUTE_SHADER"(
        #version 430 core
        layout (local_size_x=256) in;

        layout(std430, binding = 0) readonly buffer IN { 
           float samples[];
        };

        layout(std430, binding = 1) writeonly buffer OUT {
            struct {
                float real, imag;
            } dft[];
        };

        uniform int size;

	void main()	{
            uint gID = gl_GlobalInvocationID.x;
            float sumReal = 0.0, sumImag = 0.0;
            
            float angle = 0;
            float incAngle = -6.28318530718 / size * gID;

            for (int n = 0; n < size; ++n) {
                float s = samples[n];
                sumReal += s * cos(angle);
                sumImag -= s * sin(angle);
                angle += incAngle;
            }

            dft[gID].real =sumReal;
            dft[gID].imag = sumImag;
            
		};
	) COMPUTE_SHADER"";  

First it indicates that version 4.3 of OpenGL is required. Then the number of parallel processing units is established, in this case 256, then the used buffers are declared, and the size of the sample in a Uniform. For each invocation the DFT is calculated and stored in the index of the output buffer corresponding to the current invocation.

The shader creation is as always but specifying that it is a compute shader

    computeProgram.addShader(computeShader, GL_COMPUTE_SHADER);
    computeProgram.link();

buffers are created and used in the usual way but indicating that they are shader storage. it is only necessary that the data structure matches that of the shader

        struct DFTResult {
            float real, imag;
        };

        glBindBuffer(GL_SHADER_STORAGE_BUFFER, outSSBO);
        glBufferData(GL_SHADER_STORAGE_BUFFER, sizeof(DFTResult) * dftSize, 0, GL_DYNAMIC_COPY);

after updating the buffer, activating the shader, bind buffers, set uniforms, you can invoke computation, in as many groups as necessary, the more units used the less groups will be needed and therefore faster

            glDispatchCompute(dftSize/256, 1, 1);
            glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);

At this point the processor simply send the dispatch command and the program continues to run normally. When the call to the memory buffer is made, for example with glGetBufferSubData, if the computation has not finished it will wait due to indication with glMemoryBarrier.

to know the number of processing units of the system available there is a call that I do not remember now. In this case for 256 units, a buffer of 8192 samples, 67,108,864 operations involving sin and cos are computed in a few milliseconds.

More precise information on compute shaders

3 Likes

Fascinating… Is this DFT compute shader safe to call from the realtime thread?

I don’t think it is possible in any way with OpenGL as it is not designed for it, nor does it have enough control. But I have found that if you compute operations in advance in large blocks in the opengl thread, the data could be arranged in real time. So the problem must be in latency. That is, all the time needed to move the data into video memory, prepare the system, execute the calls, process the computations, dispatch them, and again receive them back to system memory, requires a minimum time that could exceed the time available in real time.

1 Like

Personally I think the power of modern GPUs is way underutilized in JUCE. Every device and even the lower class ones have so much potential for massive processing. So it’s exciting to see this here.

One thing I want to mention is, if anyone is planning to use this on MacOS it doesn’t work, since the latest supported version is 4.1 and compute shaders were introduced in 4.3. And there is no extension, unfortunately. BUT you could actually do the exact same thing in multiple other ways with almost the same performance. Sometimes even faster, depending on cache use.

  1. On old GL versions use a 1D GL_RG32f texture as sample data (imaginary, real). And the usual fragment shader pipeline. Any kind of texture interpolation can be skipped by texelFetch() instead of texture().

  2. Using uniform buffer objects as sample data and glBindBufferRange to select the processing range. Then (in shader) read from uniform data instead of texture sampling, but write the results as fragment shader output to a texture.

  3. Use vertex buffer objects with one vertex per sample, and transform feedback (and optional geometry shader), which skips most of the fragment/framebuffer pipeline part.

As you mentioned, the biggest problem for realtime is GLs async nature and the PCIe bus and readback latency. There are apparently ways to optimize this by using the correct buffer hints and using device local, vs client memory. The texture methods could possibly use pixel buffer objects for non-stalling copy readback of the result.

One observation I made is that, for very basic naive compute implementations, fragment shaders often outperform compute shaders and it heavily depends on the correct local group and dispatch size to get compute shaders to be faster. From what I know they are always exclusively used on clustered 3D data for best performance. While a DFT on a single buffer doesn’t really profit all that much.

One suggestion I have. In your example the oldschool glBufferData is used. Have you considered glBufferStorage - OpenGL 4 Reference Pages
glBufferStorage introduced in 4.4? If you specify persistent mapping, you could use
glMapBufferRange - OpenGL 4 Reference Pages
and map it persistently. Could possibly improve or simplify the readback since you get a persistent pointer to data, instead of using all the possibly implicit copies the GL driver does. Not sure if this does something for latency though. Some drivers do some fancy fast DMA copy stuff if you specify the correct bits.

On a side node. I always wondered if it was feasible to use a compute/vertex shader on juce::Path data. Instead of scanline rasterizing on CPU, then sending the vertices, it could be possible to just send the path points and perform compute dispatches to rasterize it and perform the necessary anti-aliasing. Could at least reduce the vertex data copy latency on big HD and above images.

2 Likes

I remember using the texture method on occasion and not being faster, although maybe I didn’t do a good implementation. What does seem to be faster is to use uniform buffers instead of storage buffers, although they are limited to 65kb.

In any case, the latency problem prevents it from being used in real time. Maybe with Vulkan where you have more control?. But it seems that this is an issue that remains under development and is not yet fully solved.

yes, it has a performance advantage, it is able to process the dft of 8192 samples in two channels, in 0.3 ms

1 Like

OpenGL is a very poor API for audio compute. The driver latency alone makes it essentially useless for ‘real-time’ audio, there can be 3 threads involved between you and the actual hardware.

If you truly want to try and use Compute in your audio, you should target CUDA and/or Metal.

Another big reason we don’t see DSP split across multiple threads (let alone GPUs) is that the code itself is very ‘Serial’ in nature and really hard to parallelise due to its granularity, they’re usually operating at sample level in an incremental fashion, relying on samples in the same frame (which is VERY expensive/impossible to look up on the GPU).

The outlier of course is something like FFT. A large enough data set can see absolutely insane speed-ups if the data sets are large enough.

I’ve done some playing around with using the GPU in JUCE/Audio, I have a compute module that supports OpenCL, Metal, and CUDA but it’s probably not worth the light of day! It’s certainly fun to play with though :slight_smile:

2 Likes

I’m interested in learning more about CUDA, from what I understand it seems promising for GPU-accelerated computing in a similar manner as SIMD. From the brief look I had at their API it seems you need to allocate device memory before using it.

My initial thought was to create an API similar to juce FloatVectorOperations, just a bunch of array operations that on the backend are SIMD/GPU accelerated, and I was wondering if it would be possible to create a CUDA (or even OpenGL) backend. But the requirement for device memory allocation seems to make CUDA unsuitable for implementing a bunch of free functions meant for realtime use, you’d probably need some kind of GPUContext object that handles the device memory allocation, so you can allocate in prepareToPlay and then pass this Context object to all the FloatVectorOperations calls…

Below is the repo with 3 example plugins written in JUCE using OpenCL and AMD True Audio Next. The group who wrote it, published an article this month describing some nuances of GPU audio programming - interesting but in a printed magazine and in Polish :slight_smile: ):

CUDA has a few different memory management APIs (managed, mapped, manual, etc.). You can pre-allocate all the memory you need but you need to be careful about hidden penalties when accessing memory.

The API for CUDA and OpenCL is quite simple, with a couple of C methods to query the available devices and create a context.

NVidia has its own compiler frontend that allows you to implement the kernels in the same source file as your C++ code, it also parses the parameters automatically and makes ‘invoking’ them quite a bit simpler.

It’s conceptually quite simple and implementing a naive version is trivial but you probably won’t see much performance gains unless your per-job data size is large enough to overcome the overhead of shuffling the data around.

1 Like