Comparing FFT engines

yairadix · June 7, 2021, 3:51pm

juce::dsp::FFT helpfully provides wrappers to several FFT engines (which are enabled conditionally according to preprocessor definitions), but how does one know which FFT engine to use?

It would had been nice if juce::dsp::FFT would let us choose which engine to use not only at compile-time but also at run-time, so that one could write a simple benchmark that iterates over all of them and compares results.

I’ve done proof-of-concept modifications to it to allow choosing the FFT engine and incorporated a small benchmark in the SimpleFFTDemo

and then did some comparisons of the different engines, and here are the benchmarks results on my M1 Mac:

Engine	Order=10	Order=15
Intel IPP (Rosetta)	23 μs	3 ms
Intel MKL (Rosetta)	23 μs	2.9 ms
Apple vDSP (Native)	20 μs	2.5 ms
Apple vDSP (Rosetta)	27 μs	3.5 ms
FFTW 3.3.9 (Native)	18 μs	2.6 ms
FFTW 3.3.9 (Rosetta)	29 μs	3.1 ms
JUCE Fallback (Native)	127 μs	26 ms
JUCE Fallback (Rosetta)	194 μs	35 ms

Some notes:

YMMV. This benchmark is only on my own device, as well as not being super-precise (just eyeballed the values from the app). So also feel free to run the benchmark on your devices and add your results.
JUCE requires small modifications to support IPP on platforms other than windows (included in the change). Also JUCE’s compilation breaks if enabling both MKL and FFTW (fix included in change)
To use FFTW one needs (other than installing it) to set run-time dynamic library paths to find it (LD_LIBRARY_PATH=/opt/homebrew/Cellar/fftw/3.3.9_1/lib:/usr/local/homebrew/Cellar/fftw/3.3.9_1/lib in my case)

yairadix · December 12, 2021, 2:14pm

Here’s results on a release build and there’s a huge difference apparently.
Seems that in debug most of the time spent in the benchmarks was actually in the JUCE stuff around the optimized FFT backend stuff.

This significantly changes the results!

Intel IPP is faster than Apple’s vDSP even with Apple’s running in a native process and Intel’s in Rosetta
And IPP is indeed faster than MKL
JUCE’s fallback FFT is only 4x slower than Intel’s implementation

Engine	ARM 2^10	Rosetta 2^10	ARM 2^15	Rosetta 2^15
Intel IPP	-	8 μs	-	0.9 ms
Intel MKL	-	8 μs	-	1.1 ms
Apple vDSP	9 μs	12 μs	0.9 ms	1.6 ms
PFFFT	5 μs	9 μs	0.5 ms	1.0 ms
JUCE Fallback	13 μs	20 μs	3.6 ms	4.2 ms

gustav-scholda · December 12, 2021, 5:59pm

It is definitely interesting to benchmark FFT libraries but in my experience that’s by far not the bottleneck when writing algorithms for spectral processing. For example a fast way to calculate atan2 would bring more gain, just my two cents

yairadix · December 13, 2021, 10:26am

Well, I know of at least one accomplished plugin where it takes a significant enough part of the CPU usage to make the native ARM version slower than the Rosetta one. And also the Rosetta version slows down significantly when using vDSP instead of IPP.

To make matters worse, the gaps seem even bigger when doing multi-core processing. IPP benefits from it, and vDSP not really.

yairadix · December 13, 2021, 12:30pm

Apple has a new API for FFT, now claiming:

Where possible, use discrete Fourier transforms (DFTs) instead of fast Fourier transforms (FFTs). DFTs provide a convenient API that offers greater flexibility over the number of elements the routines transform. vDSP’s DFT routines switch to FFT wherever possible.

Using that one indeed seems a bit faster. (available for the benchmarked transform in our fft-bench branch)
Unfortunately it only seems to support up to order 13
On native ARM it performs order 13 in 0.14ms rather than 0.24ms for the one currently in main JUCE.

PluginPenguin · December 13, 2021, 2:13pm

Interesting finding. But where did you find the information that the maximum supported order is 13?

yairadix · December 13, 2021, 2:14pm

From trying to use it and seeing that it returns nullptr instead of creating the context.

yairadix · December 15, 2021, 3:04pm

Have now added support for PFFFT in our fft-bench branch, to use it one needs to set JUCE_USE_PFFFT=1 and add the include path and the pffft sources to their project (I didn’t vendor it)

Also updated the benchmarks, and it appears to be very fast on the M1 native ARM!

yairadix:

Engine ARM 2^10 Rosetta 2^10 ARM 2^15 Rosetta 2^15

Intel IPP - 8 μs - 0.9 ms

Intel MKL - 8 μs - 1.1 ms

Apple vDSP 9 μs 12 μs 0.9 ms 1.6 ms

PFFFT 5 μs 9 μs 0.5 ms 1.0 ms

JUCE Fallback 13 μs 20 μs 3.6 ms 4.2 ms

pflugshaupt · January 14, 2022, 2:51pm

pffft is awesome. I’ve been using it for years on all platforms. But one needs to be aware there are multiple source code repositories with different versions of it. Older versions didn’t come with double support and arm intrinsics and there are now forks that extend it to use avx.

This seems a nice repo, but maybe someone here knows an even better version?

I expect the Accelerate.framework to get faster than pffft again with some updates by Apple. Something seems off with the fft speed. For future Apple Silicon chips there is a strong chance of getting larger SIMD types and using Accelerate would then automatically speed up ffts. Just like on an intel mac with avx.

reFX · January 14, 2022, 9:11pm

It would be nice to include KFRlib in the benchmarks as well.

We’re using it in Nexus for FFT and convolution so we could have a well-performing Apple Silicon version for the launch of Apple Silicon. Before that, we used the IPP.

ttg · January 15, 2022, 7:05am

Just worth noting. FFTW and kfr requires licensing for commercial use.

reFX · January 15, 2022, 12:06pm

Yea, but at least in the case of KFR, they are viable options. We’ve licensed KFR. Would never license FFTW with their ridiculous pricing.

pflugshaupt · January 15, 2022, 1:35pm

muFFT might be another contender (MIT license). It seems very fast and it might be possible to use sse2neon.h with it to get arm support. It can only to 2^n and 32-bit float, but for most audio usage cases, that’s all we need.

yairadix · January 16, 2022, 10:47am

We are using this repository: Bitbucket

IIUC this is the original repo, and the fork that you linked to added features that were missing in the original one in the past but nowadays the original author did integrate the missing stuff, mostly ARM/NEON.

yairadix · January 16, 2022, 10:51am

As for KFR and also muFFT, it’s not difficult to add either to the JUCE FFT wrappers on top of our fft-bench branch.
You can then run the JUCE unit-tests (from the DemoRunner) to validate your wrapping and then easily compare to the other engines. We’re already quite satisfied with pffft on ARM and IPP on Intel so it’s not an urgent priority for us in the moment to compare more engines.

pflugshaupt · August 6, 2022, 2:12pm

I assume this is about the vDSP_DFT_* family of functions.
Weirdly I haven’t seen this order 13 limitation in my development and it’s quite confusing. Order 14 transforms appear to work just fine for me on m1 - however the Apple docs state the maximum order is 12:

btw - these functions are not new at all, but were added in OSX 10.7. Is anyone using them for higher order ffts?

yairadix · August 7, 2022, 8:47am

Indeed. I linked to it

“New” is a relative term. They’re newer than the ones that JUCE uses.

pflugshaupt · February 24, 2023, 5:36pm

Just FYI: In my quest for the fastest free FFT for audio processing on m1 macs I ended up using the Ne10 fft routines. Ne10 uses the BSD3 license, has been written for smartphones and uses some assembly that can’t be compiled on macOS, but there are also intrinsics-versions of the important routines.
It took some work to get things working in my projects, but for those wondering I found the effort to be well worth the performance gain. In my measurements (2^(10-14) real to complex fft and ifft) the routines are about 20% faster than pffft.

Verbonaut · February 24, 2023, 8:09pm

Have you benched against vDSP on your system?

benvining · February 24, 2023, 8:16pm

Ne10 only supports float samples though.

Topic		Replies	Views
Native (per-platform--optimized) FFT General JUCE discussion	3	937	January 31, 2018
Juce FFT vs FFTW benchmarking? General JUCE discussion	24	6321	September 14, 2020
FFTS -- Fastest FFT implementation, and Free/BSD License General JUCE discussion	28	13534	June 11, 2021
Which Intel FFT to use - IPP or MKL? Useful Tools and Components	24	10476	July 26, 2018
Fastest FFT on Android (with new DSP module) General JUCE discussion	2	913	July 28, 2017

Comparing FFT engines

Purchase

Discover

Learn

Support

About

Events

Comparing FFT engines

Related topics

Purchase

Discover

Learn

Support

About

Events