Comparing FFT engines

juce::dsp::FFT helpfully provides wrappers to several FFT engines (which are enabled conditionally according to preprocessor definitions), but how does one know which FFT engine to use?

It would had been nice if juce::dsp::FFT would let us choose which engine to use not only at compile-time but also at run-time, so that one could write a simple benchmark that iterates over all of them and compares results.

I’ve done proof-of-concept modifications to it to allow choosing the FFT engine and incorporated a small benchmark in the SimpleFFTDemo

and then did some comparisons of the different engines, and here are the benchmarks results on my M1 Mac:

Engine Order=10 Order=15
Intel IPP (Rosetta) 23 μs 3 ms
Intel MKL (Rosetta) 23 μs 2.9 ms
Apple vDSP (Native) 20 μs 2.5 ms
Apple vDSP (Rosetta) 27 μs 3.5 ms
FFTW 3.3.9 (Native) 18 μs 2.6 ms
FFTW 3.3.9 (Rosetta) 29 μs 3.1 ms
JUCE Fallback (Native) 127 μs 26 ms
JUCE Fallback (Rosetta) 194 μs 35 ms

Some notes:

  • YMMV. This benchmark is only on my own device, as well as not being super-precise (just eyeballed the values from the app). So also feel free to run the benchmark on your devices and add your results.
  • JUCE requires small modifications to support IPP on platforms other than windows (included in the change). Also JUCE’s compilation breaks if enabling both MKL and FFTW (fix included in change)
  • To use FFTW one needs (other than installing it) to set run-time dynamic library paths to find it (LD_LIBRARY_PATH=/opt/homebrew/Cellar/fftw/3.3.9_1/lib:/usr/local/homebrew/Cellar/fftw/3.3.9_1/lib in my case)
16 Likes

Here’s results on a release build and there’s a huge difference apparently.
Seems that in debug most of the time spent in the benchmarks was actually in the JUCE stuff around the optimized FFT backend stuff.

This significantly changes the results!

  • Intel IPP is faster than Apple’s vDSP even with Apple’s running in a native process and Intel’s in Rosetta
  • And IPP is indeed faster than MKL
  • JUCE’s fallback FFT is only 4x slower than Intel’s implementation
Engine ARM 2^10 Rosetta 2^10 ARM 2^15 Rosetta 2^15
Intel IPP - 8 μs - 0.9 ms
Intel MKL - 8 μs - 1.1 ms
Apple vDSP 9 μs 12 μs 0.9 ms 1.6 ms
PFFFT 5 μs 9 μs 0.5 ms 1.0 ms
JUCE Fallback 13 μs 20 μs 3.6 ms 4.2 ms
10 Likes

It is definitely interesting to benchmark FFT libraries but in my experience that’s by far not the bottleneck when writing algorithms for spectral processing. For example a fast way to calculate atan2 would bring more gain, just my two cents :wink:

Well, I know of at least one accomplished plugin where it takes a significant enough part of the CPU usage to make the native ARM version slower than the Rosetta one. And also the Rosetta version slows down significantly when using vDSP instead of IPP.

To make matters worse, the gaps seem even bigger when doing multi-core processing. IPP benefits from it, and vDSP not really.

1 Like

Apple has a new API for FFT, now claiming:

Where possible, use discrete Fourier transforms (DFTs) instead of fast Fourier transforms (FFTs). DFTs provide a convenient API that offers greater flexibility over the number of elements the routines transform. vDSP’s DFT routines switch to FFT wherever possible.

Using that one indeed seems a bit faster. (available for the benchmarked transform in our fft-bench branch)
Unfortunately it only seems to support up to order 13 :frowning:
On native ARM it performs order 13 in 0.14ms rather than 0.24ms for the one currently in main JUCE.

3 Likes

Interesting finding. But where did you find the information that the maximum supported order is 13?

From trying to use it and seeing that it returns nullptr instead of creating the context.

Have now added support for PFFFT in our fft-bench branch, to use it one needs to set JUCE_USE_PFFFT=1 and add the include path and the pffft sources to their project (I didn’t vendor it)

Also updated the benchmarks, and it appears to be very fast on the M1 native ARM!

7 Likes

pffft is awesome. I’ve been using it for years on all platforms. But one needs to be aware there are multiple source code repositories with different versions of it. Older versions didn’t come with double support and arm intrinsics and there are now forks that extend it to use avx.

This seems a nice repo, but maybe someone here knows an even better version?

I expect the Accelerate.framework to get faster than pffft again with some updates by Apple. Something seems off with the fft speed. For future Apple Silicon chips there is a strong chance of getting larger SIMD types and using Accelerate would then automatically speed up ffts. Just like on an intel mac with avx.

1 Like

It would be nice to include KFRlib in the benchmarks as well.

We’re using it in Nexus for FFT and convolution so we could have a well-performing Apple Silicon version for the launch of Apple Silicon. Before that, we used the IPP.

3 Likes

Just worth noting. FFTW and kfr requires licensing for commercial use.

Yea, but at least in the case of KFR, they are viable options. We’ve licensed KFR. Would never license FFTW with their ridiculous pricing.

1 Like

muFFT might be another contender (MIT license). It seems very fast and it might be possible to use sse2neon.h with it to get arm support. It can only to 2^n and 32-bit float, but for most audio usage cases, that’s all we need.

1 Like

We are using this repository: Bitbucket

IIUC this is the original repo, and the fork that you linked to added features that were missing in the original one in the past but nowadays the original author did integrate the missing stuff, mostly ARM/NEON.

As for KFR and also muFFT, it’s not difficult to add either to the JUCE FFT wrappers on top of our fft-bench branch.
You can then run the JUCE unit-tests (from the DemoRunner) to validate your wrapping and then easily compare to the other engines. We’re already quite satisfied with pffft on ARM and IPP on Intel so it’s not an urgent priority for us in the moment to compare more engines.

I assume this is about the vDSP_DFT_* family of functions.
Weirdly I haven’t seen this order 13 limitation in my development and it’s quite confusing. Order 14 transforms appear to work just fine for me on m1 - however the Apple docs state the maximum order is 12:

btw - these functions are not new at all, but were added in OSX 10.7. Is anyone using them for higher order ffts?

Indeed. I linked to it

“New” is a relative term. They’re newer than the ones that JUCE uses.

Just FYI: In my quest for the fastest free FFT for audio processing on m1 macs I ended up using the Ne10 fft routines. Ne10 uses the BSD3 license, has been written for smartphones and uses some assembly that can’t be compiled on macOS, but there are also intrinsics-versions of the important routines.
It took some work to get things working in my projects, but for those wondering I found the effort to be well worth the performance gain. In my measurements (2^(10-14) real to complex fft and ifft) the routines are about 20% faster than pffft.

4 Likes

Have you benched against vDSP on your system?

Ne10 only supports float samples though.