Cuda phase vocoder

It would be fun to try and build a standalone juce audio application that did real time pitch detection and pitch shift in the GPU

That might be on my wish list of fantasy JUCE features.
I keep wondering about what audio effects can be improved by parralelisng stuff … the new i9 made me think about it more.clock speed is the same just more and more cores…sean

1 Like

I am not sure how much the phase vocoder processing performance could be improved by doing it in parallel. Maybe process stereo or surround signals using multiple CPU cores, but GPU…? Is there a need to do dozens of the processings at the same time? Multicore CPU power might be enough anyway. It’s also a process where access to previously calculated data is needed, so not a strong contender for a GPU based implementation.

Still, including some phase vocoder processing classes (even just CPU based) in JUCE would be welcome, because with those it’s easier to do actually interesting spectral processing. Just having some FFT classes available doesn’t enable doing much…


You are probably very right Xenakios. It’s a very complicated algorithm with lots of interleaved parts difficult to separate that’s for sure.
Taking this one as an example …

Inside this loop for example
/* main processing loop */ for (i = 0; i < numSampsToProcess; i++){

Is it possible to divide up blocks of samples and send them out to processes ?
In theory they could be reassembled in place in the master process.
Perhaps this would allow larger or smaller frames or buffer sizes to be processed and improve performance.
Easy to say but probably doesn’t help in practice and a bugger to figure out …
Someone smarter than me might be able to parallelize some of those calculations without creating static.
That would certainly be on my wish list a simple JUCE pitch shift or pitch detection with FFT classes I could use …

I don’t know about GPU programming but at least for phase vocoder - it isn’t a very complicated algorithm. The baseline phase vocoder from the 1970s takes less than 50 lines of Matlab to implement even from scratch and if you do it in Python with libraries for short-time Fourier transform already it’ll just be a for loop over all STFT frames and mess a bit with the instantaneous frequencies. Even some newer versions with improved quality would take less than 10 lines longer than the baseline.
They really boil down to STFT → processing → ISTFT, where the performance bottleneck is vastly dominated by STFT/ISTFT. In the non-realtime case, parallelizing STFT is pretty straightforward. You just divide the input into frames and swipe the kernel through. But as for realtime implementations, latency is a concern. Even if GPU latency is low, you can’t really process N frames at a time because that’ll add (N * frame size * overlap ratio) samples of delay. On the other hand, if N = 1, there’s no point of doing it on a GPU if it doesn’t scale. Oh, but it could help if you want to share the kernel across M channels of audio.
Jean Laroche and Mark Dolson had a pretty solid paper on several improved phase vocoders. I highly recommend giving them a try.

J. Laroche and M. Dolson. Improved Phase Vocoder Time-Scale Modification of Audio. IEEE Trans. on Speech and Audio Proc, Vol. 7, No. 3. May 1999.


There are very rare cases where parallel processing for DSP in a plugin is required. Most plugins use far less than one CPU and in my opinion it makes no sense if you don’t need more than one CPU. Executing DSP parallel also adds overhead and is complicated.

It’s up to the host to process the loaded plugins on different CPUs. So we already use some kind of parallel processing.


Worth noting about that code :

  1. It’s not even compatible with multithreaded processing from the start because of the static variables used. (That is, if you’d put that code inside a VST or similar plugin, multiple instances of the plugin would not work in hosts that do multithreaded audio processing.) Likewise, the code wouldn’t work even in the same plugin instance for processing stereo or more channels. Those problems are not too complicated to fix, just make a C++ class out of the code and move the static local variables to class members.

  2. It uses a simple one function FFT implementation. You would definitely want to change the code to use some more advanced (read : optimized) FFT implementation.

While it’s hard to actually parallelize spectral processing, there is one use case for multithreading in a real-time context that should be mentioned. If the FFT and hop sizes are (much) larger than the audio buffer size, you could offload the actual processing to a worker thread. Just imagine working with an FFT size of 4096 samples, hopsize 1024 and a DAW buffer size of 32. If you did it in a single-threaded fashion, that’d mean you would have to execute a 4096 point FFT every 32 callback invocations (or in other words, a CPU spike every 32 callbacks). By feeding the audio data to a background thread, this thread can distribute the processing load over several callback periods. The price of course is some additional latency, but FFT processing comes with inherent latency anyway.

Thanks Xenakios!
Inadvertently you helped me solve another issue I was having trying to implement that routine …

It works pretty well just processing one channel at a time standalone at this point which is really all I need. I will have to think a bit about how to implement the whole routine as an object. I assume that there is already some sort of JUCE object instantiated which it might need to fit inside or inherit from in some way ?

I don’t suppose it’s worth the trouble to inherit from some JUCE class for it. Just do a new class that doesn’t inherit from anything. (JUCE’s AudioProcessor and AudioSource have lots of methods to implement that are not really relevant for the thing.) You really have to get rid of the static variables that are in the phase vocoder processing function even if it looks like now that it is working! So you need to make it work as a class.

Thanks Xenakios that sounds like something I can figure out!
I assume pointers to the buffers are private members of the class and setter functions can do the processing … I am showing my C++ ignorance here!

I’ve previously converted the smbPitchShift code into a C++ class but I’ve probably lost the code, so I’d need to rewrite it…

it will be good practice for me to try thanks Xenakios!!

I started working on the class based implementation. It’s now working more or less.

The current version is here, but I will make some additional clean ups into the code…(For example I would change the statically sized float arrays into std::vectors) :

A thing to note is that the algorithm seems to be wanting buffers that are sized as powers-of-2, which is a pain. For example my audio interface in Windows WASAPI mode has a buffer size of 441 samples, so the algorithm isn’t directly compatible when running as a standalone app and using WASAPI…It does work as a VST plugin in Reaper with ASIO and a buffer size of 512 samples.

so nice thanks!! I will play around with it !!!