GPGPU in Juce?


#1

Hello. I was somewhat confused by the fact that it is not even mentioned here at all. Are there really no plans and no reason to implement GPGPU support (e.g. OpenCL wrapper) in Juce?


#2

It’s not something I’ve ever needed myself, and nobody else has asked for it yet!


#3

Thank you for the reply. Seems strange to me that audio industry still isn’t interested too much in using GPUs for DSP


#4

Interesting

Salvator


#5

i think this has some reasons

  • it adds extra latency (from ram > gpu > ram)
  • for most (not all) computations modern cpus are just enough
  • stability and driver support/no standard interface
  • support-costs

#6

[quote]i think this has some reasons

  • it adds extra latency (from ram > gpu > ram)
  • for most (not all) computations modern cpus are just enough
  • stability and driver support/no standard interface
  • support-costs[/quote]

There are other limitations that would be far more important to note:
[list]

  • The learning curve for OpenCL or the liking.
  • The lack of full general GPU support; as in, the lack of support on all cards by all manufacturers.
    [/list]

All that aside, as cool as it would be to at least have it as an option, it can certainly be used for non real-time computing (ie: Final mix bounce without playback; process all DSP on GPU, or similar).


#7

i think this has some reasons

  • it adds extra latency (from ram > gpu > ram)
  • for most (not all) computations modern cpus are just enough
  • stability and driver support/no standard interface
  • support-costs[/quote]

Sorry for resurrecting this old thread, but I really feel like there are a few myths about GPU compute that are in desperate need of busting. IMO the whole “GPUs are no good for audio processing” is a scam, created by an industry that sells overpriced DSP modules that are absolutely no competition even for mediocre GPUs.

Point 1: it adds extra latency
True, but do people actually have any idea of how much? A 16 lane PCI Express 3 interface provides 16 GB/sec bandwidth. Even a 2.x 16 provides 8 GB/sec which is more than what a high end DAW system could provide 5-6 years ago. The actual latency from system to GPU memory is in the range of microseconds, so latency is mostly depending on what kind of buffering you chose and naturally on cache misses.
And once data is in the GPU, the internal bandwidth exceeds 100 GB/sec. An HOUR of audio at 96 KHz is about 1350 MB. You don’t even have to do the math… Worst case scenario you have 5-6 GB/sec in and out of the GPU and internal bandwidth in excess of 100 GB/sec. Also don’t forget that GPUs are being integrated into processors on regular basis, the amount of integration only increases, and in that case both the CPU and the GPU use the same memory.

Point 2: for most (not all) computations modern cpus are just enough
This might be true, however, offloading the processing of already recorded audio from the CPU will make it possible to have even more LIVE tracks at even lower latency. And when it comes to performance in number crunching, a modern GPU can easily offer 20x (20 times more) performance than a high end CPU. That means you will never ever have to freeze a track again to save on processing. That means you can throw tremendously more DSP at your tracks without chocking your system. Not to mention there are motherboards with 3 or 4 GPU slots, which are significantly cheaper than a multi socket workstation with Xeons. But let’s not forget mobile devices, which have much weaker CPUs, where GPU compute could make a tremendous difference. GPU compute can enable ultra portable fully fledged mobile DAW systems in handheld form factor. Who wouldn’t like that? (besides an industry that is happy to overcharge consumers for “specialized” hardware) What about 20x times faster final mixdown?

Point 3: stability and driver support/no standard interface / The lack of full general GPU support; as in, the lack of support on all cards by all manufacturers.
That is not true today and it wasn’t true at the time of the previous publications. CUDA has been mature for quite a while, but it is vendor limited. OpenCL however is not. It is a mature, standardized technology with many implementations and many more to come that is stable since November 2011. Today it is supported on every major desktop platform and it will soon hit mobile devices. Also it is not strictly GPU oriented, OpenCL kernels will execute on whatever supported hardware you throw them onto in the most efficient manner possible, using AVX, SSE or even NEON instruction sets to execute in a highly parallel and vectorized manner. That means you can use the same kernels and only switch the device to a CPU if latency is becoming an issue, and only for the tracks you need it, e.g. that are being monitored and processed live. What is more - OpenCL kernels can be compiled on the go from basic string literals, so your DSP is no longer limited by your compiled C++, you can change the kernels on the go without having to recompile entire plugins and without sacrificing performance. As for driver stability - it is significantly easier to only write the API without the functionality. Video drivers are buggy mostly because of the implementations of the features, but with GPU compute it is the programmer who does the actual feature implementation, so - expect much less problems with OpenCL drivers than with OpenGL for example.

Point 4: support-costs
I don’t see a reason for support cost to be higher because of the utilization of GPU compute.

Point 5: The learning curve for OpenCL or the liking.
The learning curve of using OpenCL itself is not steep at all. It is a very small API, and after you’ve done it a few times, it becomes rudimentary to create, compile and execute kernels on available OpenCL devices. The actual parallel paradigm of GPU compute however is a little steeper, but certainly offering ample returns of the investment.

And it is not just audio that can benefit from all that performance, there are already GPU compute implementations of OpenVG renderers that are significantly faster than what a CPU can offer, while keeping the CPU itself free for more important tasks. Image blending, image effects, immense amounts of complex transformations - all workloads, capable of chocking even the fastest CPU, all can be offloaded to the GPU with full programmability and without the limitations of graphics APIs. Using GPU compute you can implement advanced features that have been locked out and exclusive to Quadro/FireGL professional grade CAD graphics for ages. And so much more…

A few decades back people were hardly thinking CPUs need to have a floating point unit, and it was regular practice to build CPUs without one, later on the FP units became external, when bandwidth became an issue they got integrated and now there is hardly anyone who can imagine a CPU without an FP unit. In the case of GPU compute, history is merely repeating itself, GPUs are already being integrated in the vast majority of processors manufactured today, gradually transforming from useless toys suitable only for games in the next iteration of immensely powerful number crunching co-processors, literally opening doors to a whole new realm of unprecedented computational power.


#8

I’m thinking you really mis- or over-interpret me here. What i have written are just conclusions i had in a real word situation. Of cause there are application areas and algorithms in audio where heavy paralellization could be very useful.
If latency/supports costs etc. is no problem anymore, fine, i really appreciated it.


#9

I just quoted your post as a starting point, no interpretation of your post in particular. I have heard those same arguments countless times, and none of the instances was ever substantiated. Really like a dogmatic response - latency, vendor dependance, lack of standardization, immaturity, lack of support…

On my system I have a fairly old midrange GPU - a Geforce GTX 460. OpenCL benchmarks reveal the following performance figures (not theoretical but real word):

Internal memory bandwidth - 90 GB/sec
System to Device bandwidth - 6.2 GB/sec
Device to system - 6.2 GB/sec
Time to copy capacity (1GB) - 11 microseconds
Time to read capacity - 162 microseconds
Time to write capacity - 161 microseconds
Peak processing power - 907 GFLOPS (compared to 67 GFLOPS for my cutting edge I7-3770k)

Those latency and bandwidth numbers significantly exceed the performance figures of a 6-7 year old CPU, which IIRC was more than capable of providing low latency audio processing. And for modern high-end GPUs the numbers are much, much better.

Keep in mind vector instructions (SSE, AVX) also add latency to be schedules and executed, and yet there is absolutely no DAW application that doesn’t use those, in fact at least SSE2 is often mandatory. In reality all audio processing algorithms are running vectorized and in parallel, if you don’t implement it explicitly, the compiler does its best to do it automatically for you. The only difference is for OpenCL it won’t happen automatically and you have to invest some more time in design considerations.

Also, once you get audio data to the GPU, the internal bandwidth is comparable to on-die L3 CPU cache - way much faster than system ram which the CPU will use in a typical workload. A well crafted GPU compute engine can avoid excessively moving data between system and GPU memory for every pass and limit it to sending raw audio input and receiving fully processed audio output and keep all the intermediate data on the GPU memory, even 1 GB is more than enough for a DAW buffer, not to mention there are 3+ GB GPUs on the market.

I really don’t think the GPU itself will be the cause of a latency bottleneck considering that a USB2 based audio interface can stream over a dozen channels of audio over a very a low bandwidth and highly latent serial interface and provide what is considered to be “satisfactory” performance. Compared to that, PCI Express and the GPU itself are tremendously less latent, with enormous bandwidth and highly parallel.


#10

Wouldn’t GPGPU be limited to the desktop? MMX and SSE (and SSE’s latter versions) are Intel… Desktop only too?

Also, what guarantee is there that any GPGPU code will work out on… not your platform with not so similar hardware? Try your code out on a shitty laptop that has an ATI graphics card… an Intel graphics card… I bet you OpenCL 1.0 won’t be working generically for all the forms of these manufacturers’ hardware. Hell, shaders don’t even work the same for all these graphics cards! What makes you think GPGPU will?

From Jules’ perspective, wouldn’t that mean JUCE would have to have a system setup per platform, per type of hardware, to take full advantage of such APIs? And who’s going to have all the hardware to test all these cases out? That would be a learning curve… moreso a pain in the ass, really.


#11

Well, NEON on ARM devices is basically a SIDM implementation for RISC processors. The first few iterations of ARM processors did not feature vector/SIMD unit but now pretty much every recent model has one, and slowly but steadily 128bit wide SIMD units are becoming a common thing.
So while the SSE instructions are limited to desktops it is only intel’s flavor of it that suffers from this limitation, SIMD instructions are not limited to desktop machines. And in fact, Intel’s latest Medfield processor, which is strictly mobile, intended for use in phones and tablets, which does support SSE, as do all previous atom processors.

And OpenCL is indeed coming to mobile devices, the MALI 600 GPU from ARM already supports it. PowerVR announced an OpenCL supporting devices back in 2009, way before OpenCL was a stable standard. Mobile x86 CPUs from AMD support it too. The thing is that implementing OpenCL is not that much of a big deal, as I said, it is much smaller and simpler than creating an OpenGL implementation for the same hardware, and the sooner OpenCL becomes widely utilized, the sooner the rest of chip makers will step forward with implementations of it. PGI have already brought up an OpenCL compiler for multicore ARM devices.

What guarantees GPGPU code will work across different platforms? Well, OpenCL for one does, that is the whole point of establishing a standard, so that different implementations have a common interface to conform to. A complete implementation will grant a device OpenCL compliance, and while implementations ad their usage from C/C++ do vary, the kernels written on OpenCL should work the same way. Just like different GPUs with different implementations of OpenGL can execute the same OpenGL routines, with the exception that OpenCL is much smaller and much closer to the hardware than OpenGL. That is the reason you get different results or completely broken features on some OpenGL devices - because the implementation is huge and high level, whereas OpenCL is basically a subset of C and it boils down to copying data and performing arithmetic operations on it - that’s it. I have tested OpenCL on my laptop that does feature a shitty Radeon GPU and it does work exactly the same way as my desktop PC, only much slower.

As far as portability is concerned, it will be far less of an issue compared to for example audio. Just look at how JUCE uses different interfaces for audio on each platform - for android OpenSL and native audio, for iOS CoreAudio, for Linux - alsa and jack, for windows - ASIO and DS, each demanding its own implementation. With OpenCL it should be much more straightforward, consistent and easy, since on every platform you will have the same conforming to a standard interface. The whole point of OpenCL is to be hardware independent, and work on every processor, be that a CPU or GPU that complies to the standard.

The real pain in the ass would be to get yourself to think and program in a paradigm that can take full advantage of the power OpenCL offers. One of the main reason CUDA managed to establish itself despite being vendor limited is that its sole vendor went forward and created a fairly big library of functionality for programmers to take for granted, but I imagine as OpenCL picks up libraries for it will start popping out, which should make things much easier for everyone.

I feel like I should clarify, I don’t mean that GPU compute is a substitute to CPU compute, GPUs need a much bigger workload to really shine, in the case of small and short tasks, the IO overhead completely overshadows the raw performance of the GPU. A GPU will not be able to handle a real-time audio plugin anytime soon, but it can easily offload all processing of already recorded data from the CPU. And in time, as GPUs become more integrated with CPUs, they will get more efficient at smaller tasks too, because all IO overhead will be eliminated, and the CPU part itself will be pretty much dedicated to schedule work for the GPU part, increasing computational throughput significantly at the cost of negligible latencies, comparable to those of SIMD units. In fact I think in time SIMD units will be replaced by MIMD units and pretty much put the GPU in the place of the current vector units.


#12

Even with all that, overall support doesn’t look that good, in terms of covering all platforms that JUCE covers as is. From what I understand; JUCE is WinXP compatible, and should be compatible with all desktop oriented hardware, at the very least.

Looking at ATI’s APP supported OSs: Vista® SP2 (32-bit/64-bit) minimum.

Looking at Intel OpenCL: Support seems to be limited by CPU?

It’s not clear to me exactly, but reading about the OpenCL specification itself; it was developed in 2008… so, does that mean you can’t run OpenCL code on video cards manufactured prior to 2008?

CUDA has a fairly narrow scope for supported cards too.

To me, dealing with all these edge cases doesn’t seem worthwhile… I wouldn’t make use of GPGPU unless my software targeted a specific OS and specific family of graphics cards. The app would have to know at run-time what the OS and CPU are (which juce has SystemStats for), but also the graphics card(s) (via OpenGL GL_VENDOR?)… and then it can setup a massive back-end system to deal with all of that. (Or is that not something too farfetched? To me it is…)


#13

Intel are yet to produce a decent GPU, that is the main reason their implementation of OpenCL is CPU based. I am sure Xeon Phi (which was supposed to be a mainstream Larrabee GPU) will support it since it is essentially a parallel vector co-processor. Hopefully we will see OpenCL support for Intel’s IGP too, even thou it is fairly mediocre in performance.

You seem to be too concerned with support for aging software and hardware. Windows 98 users will only drop in time, so are the users who own old GPUs. Those are not being manufactured or supported anymore. Those are platforms whose days are pretty much numbered. With prices of ram being so low, who will ever go forward and use an old 32bit OS anyway? Spare components for those platforms are not being manufactured today, and the little outdated stock that is left is selling overpriced often to the point it is better off to just get a new system instead of upgrading or replacing components of an old one. I am all for platform longevity, but concerns of the past should not impair the future, go for the new stuff and resort to fallbacks for the older platforms instead of making compromises with much more beneficial technologies. CPU manufacturers long have abandoned the race for single threaded performance, the future is parallel, and GPU compute is as parallel as it gets.

As I said, OpenCL support is yet to pick up, the stable release is less than 1 year old, I expect this time next year there won’t be many devices which don’t support it.

Besides, a C++ OpenCL wrapper is a minor endeavor compared to some other stuff I’ve read of Jules investing time and effort into.


#14

Thank you for those numbers. It’s most enlightening.


#15

[quote=“jrlanglois”]Wouldn’t GPGPU be limited to the desktop? MMX and SSE (and SSE’s latter versions) are Intel… Desktop only too?

Also, what guarantee is there that any GPGPU code will work out on… not your platform with not so similar hardware? Try your code out on a shitty laptop that has an ATI graphics card… an Intel graphics card… I bet you OpenCL 1.0 won’t be working generically for all the forms of these manufacturers’ hardware. Hell, shaders don’t even work the same for all these graphics cards! What makes you think GPGPU will?

From Jules’ perspective, wouldn’t that mean JUCE would have to have a system setup per platform, per type of hardware, to take full advantage of such APIs? And who’s going to have all the hardware to test all these cases out? That would be a learning curve… moreso a pain in the ass, really.[/quote]
I agree. But GPU usage would normally be optional, with code duplicated for the CPU as well, at least that’s how I see it. Most apps using GPU’s test the system and decide if it’s worth it in the first place.


#16

My opinion on the whole GPU-for-audio thing:

I think it could be very powerful and GPUs would be great for crunching the numbers… BUT I don’t think it’s possible to write a plugin using today’s plugin formats that uses the GPU effectively, because of latency issues. (I’m ignoring RTAS, of course, which has done DSP stuff for years…)

Really, the only way to make it work well would be to build a host whose entire audio pipeline was GPU-based, and to define a completely new plugin format, where the plugins provide GPU code for the host to compile (like GL shader code). The host could compile these programs together into a very efficient rendering path.

Of course this’d be enormously difficult to achieve, both technically and from a business perspective, because you’d not only need to build an entire host, but would need persuade all the existing plugin devs to duplicate all their existing algorithms in a vastly more complicated and restrictive way. Definitely not easy to do!


#17

I totally agree with chordofdiscord, okay CPU to GPU memory transfer is much slower than GPU to GPU one, but it’s still 8gb per second (vesus ~150Gb) I personnaly need 300 kb/sec for a stereo signal :wink:
Latency… yes and no, who plays a first person shooter at 80fps (which draws millions of polygons on a 1280x1024 = 1’310’720 pixels display) should be able to render 88 times a second a buffer of 512 frames….
If only I had time to experience this…
Note: the language you are talking about is OpenCL Jules, and compiling at runtime is exactly what it is doing depending on the GPU brand, that is why it sounds that interesting. OpenCL based on c99, I do not think it would be that complicated to build a kind of GPU internal rendering graph which one would obviously be managed by clean c++. or even maybe only write kernels in c99 and functions calls in c++ (keeping the memory in the graphic card until the very end of the rendering)


#18

Irrelevant. It’s latency that matters, not bandwidth. And GPUs are built for high bandwidth, but not low latency. In fact, when drivers batch together operations, it improves bandwidth at the expense of latency.

If your audio buffer is 64 samples, that gives you 1ms to do the processing. So within 1ms you’d need to do a minimum of two complete bus transfer operations (to and from the GPU) with absolutely no glitches, and also leave some time for actually doing the processing in between. That’d be the equivalent of a game running at more than 1000fps, and being absolutely rock-solid at that speed, regardless of what the rest of the system is doing.

But by far the biggest problem is that for it to work, ALL your plugins would need to run on the GPU, because if some operations are still running on the CPU, there’s a massive problem in routing the data between them. So you’d need to persuade ALL plugin writers to rewrite their code in GPU style. It’s a chicken-and-egg problem.


#19

OpenCL as a language is very simple, however the gap between learning the language and using it is pretty wide in OpenCL, it is not the language but the parallel processing paradigm that is challenging.

I have stumbled upon this PDF from Chronos that talks about real time audio auralization processing, modelling sound after actual complex 3D geometry in a dynamic scene - pretty much a very sophisticated reverb. The PDF mentions using a 4096 sample buffer, resulting in about 100 msec latency for 44.1 khz audio. I do realize latency may not necessarily scale linearly in practice, but if a buffer of 4096 = 100msec, then a buffer of 256 = 6.25msec and a buffer of 128 = 3.125msec. Lets not forget a GPU can do 20-100x times the work of a high end CPU in the same amount of time, and that GPU memory capacity is in the realm of gigabytes and is MUCH faster than system memory, with bandwidth comparable to on-die CPU cache. There is certainly lots of potential.

On the subject of how applicable OpenCL and GPU compute are - one simply has to ask the question - HOW MUCH of your tracks run LIVE?

99% of the time I only need 1 or 2 channels running DSP live, and many many that are already recorded and can be buffered deeper and precomputed in advance and just sync’d to the live signal. GPUs have plenty of CPU to GPU bandwidth, enough to to buffer an entire multitrack project in an instant, and the internal bandwidth is plain and simple brutal, so the trick to get ultra efficient GPU processing is to be smart and keep data in the GPU in-between different DSPs to avoid having to shift data around unnecessarily.

As for the actual latency of GPUs - even thou it does add latency, I don’t think it is enough to choke something as lightweight as audio. I think this misconception originates because of the fact GPUs are massively parallel and people usually throw massive amount of data at them, which takes time to copy, the actual hardware latency is in the realm of microseconds, operation latency itself depends on how much data is involved. Transferring tiny audio buffers will take nowhere nearly the time it takes for a typical GPU compute workload to be transferred. GPUs are not inherently latent, it is a misconception and a myth, perpetuated by an industry that makes good money on DSPs that have the fraction of the performance of a GPU at a few times the price. The latency of a GPU is no higher than the latency of a PCI-E dedicated DSP card, it is the latency of the interface.

But then again, even if real time DSP is not doable on a GPU, there is a huge potential for improvement on implementing GPU compute for pre-recorded tracks, bouncing, freezing tracks and so on… On my toneport line6 have added a separate buffer for their own processing, which is independent of the ASIO buffer length, so I can have low latency monitoring for my live signal and more stable, long buffer DAW. But it is all still processed on the CPU, a similar schema can be implemented to offload the longer buffer to the GPU and allow for even shorter buffers and even lower latency for the live processing.

1000 FPS in a game - entirely possible, I don’t play many games but I have Call of Duty 2 and it has no problem sustaining 1400 FPS on lower settings on my aging midrange GPU. 64 samples is overkill thou, many audio interfaces don’t even support such small buffers. And modern games that run at ~100 FPS use so much processing that will choke a CPU to like… 0.1 FPS.

I think it is about time to address GPU compute, mostly because CPU makers have pretty much given up on boosting CPU performance, just take a look at the last few generations of Intel and AMD processors - most of the chips they sell have integrated GPUs and there is very little focus on improving CPU performance with every new generation, while GPU performance gets a significantly bigger boost. For example, Sandy Bridge to Ivy Bridge offers 5-7% improvement of CPU performance and about 60% improvement of GPU performance.

With GPU compute applications will be able to utilize more of the current and future generations of CPUs, which are headed for completely fusing the GPU and CPU, morphing into MIMDs in addition to SIMDs, using the main system memory directly, with no need of sending extra copies of the data for computing it on the GPU. And OpenCL can fallback to process on every compatible processor in an optimized parallel way, you can target different OpenCL devices, just run live tracks OpenCL kernels on the CPU and the rest of the tracks can use the very same kernels, just ran on the GPU.


#20

At the risk of appearing like a troll: This market is way to small for ambitioned endeavours like that.