GPU processing with opengl and compute shaders

CUDA has a few different memory management APIs (managed, mapped, manual, etc.). You can pre-allocate all the memory you need but you need to be careful about hidden penalties when accessing memory.

The API for CUDA and OpenCL is quite simple, with a couple of C methods to query the available devices and create a context.

NVidia has its own compiler frontend that allows you to implement the kernels in the same source file as your C++ code, it also parses the parameters automatically and makes ‘invoking’ them quite a bit simpler.

It’s conceptually quite simple and implementing a naive version is trivial but you probably won’t see much performance gains unless your per-job data size is large enough to overcome the overhead of shuffling the data around.

2 Likes