FFTS -- Fastest FFT implementation, and Free/BSD License


#1

I was just chatting with moala on the IRC channel (#juce on irc.freenode.net) and he linked http://stackoverflow.com/questions/463181/c-c-fft-library-with-non-gpl-license

Notice the answer that says:

Check out FFTS. It has a permissive BSD license, and it is faster than FFTW, IPP, KissFFT, Apple vDSP etc on ARM and x86.

http://github.com/anthonix/ffts

That answer is written by the author of FFTS. If he is correct about the benchmarking, then this would make a super little Juce module...


#2

Nice find. I found a link to their IEEE paper which shows some benchmark charts: http://www.cs.waikato.ac.nz/~ihw/papers/13-AMB-IHW-MJC-FastFourier.pdf

Some more info but unfortunately no doco here: http://anthonix.com/ffts/


#3

Yep, that's me [moala ‚ÄĒ who posted that link ‚ÄĒ just to let you know, that's all].


#4

Unfortunately the current version doesn't support MSVC, and the author(s) told me they had no current plans for this.

There is a fork using CMake (https://github.com/henrygouk/ffts) that builds on OSX, but I couldn't manage to make it work on Windows.


#5

+1, just to say that would be really nice to have a good FFT class inside juce!!


#6

As far as I can see, the article does not mention the IPP version (currently at 8.1x). I have two licenses (Win and OSX) so I am troubled by cognitive dissonance and will stick to it ;-)


#7

Lorcan seems to have struck FFTS off the list by pointing out that it is not portable across Juce's target platforms.

My bad, I didn't even consider that this might be the case.

So I guess the ideal candidate would sacrifice peak efficiency in order to attain platform independence (and one that is reasonably future proof).

In my own work I've done something like this in the past; I've provided a robust but suboptimal pure ANSI-C FFT, with #define checks that will replace it with a function that wraps the Accelerate-framework's FFT functions if it detects the platform is OSX/iOS.

http://stackoverflow.com/questions/3398753/using-the-apple-fft-and-accelerate-framework/


#8

You might want to consider Ooura FFT http://www.kurims.kyoto-u.ac.jp/~ooura/fft.html

I has a very permissive license and great performance, altough the source code needs a little tweaking to use in a c++ wrapper.

I've made a little comparison of various FFT packages below ('diff' is the relative error in dB)

  • Laurent DeSoras FFTReal 2.11 http://ldesoras.free.fr/prod.html#src_audio
  • KissFFT 1.30¬†http://sourceforge.net/projects/kissfft
  • Julien Pommier's PFFFT¬†https://bitbucket.org/jpommier/pffft (float only)
  • Ooura (double only, can be adapted to float)
  • Apple vDSP¬†https://developer.apple.com/library/mac/documentation/Performance/Conceptual/vDSP_Programming_Guide/Introduction/Introduction.html

As you can see Apple vdsp is very well optimized (~2x throughput for doubles), but Ooura FFT is the best performer for cross-platform liberal-licensed packages.

Also 64-bit compilation seems to give better performance.

MSVC 2013 x86, core i7 2600

 FFT[ 4096] fwd:   36.6us (3358 mflops), inv:   33.9us, type: DeSoras<float>, diff:  -135.01dB, cycles: 60/smp
 FFT[ 4096] fwd:   37.2us (3304 mflops), inv:   37.7us, type:    Kiss<float>, diff:  -135.68dB, cycles: 64/smp
 FFT[ 4096] fwd:   23.7us (5193 mflops), inv:   26.7us, type: Pommier<float>, diff:  -135.29dB, cycles: 43/smp
 FFT[ 4096] fwd:   30.3us (4057 mflops), inv:   23.1us, type:   Ooura<float>, diff:  -135.88dB, cycles: 45/smp
 FFT[ 4096] fwd:   41.7us (2949 mflops), inv:   41.6us, type: DeSoras<double>, diff:  -310.71dB, cycles: 71/smp
 FFT[ 4096] fwd:   47.5us (2585 mflops), inv:   41.7us, type:   Kiss<double>, diff:  -310.60dB, cycles: 76/smp
 FFT[ 4096] fwd:   26.5us (4638 mflops), inv:   25.0us, type:  Ooura<double>, diff:  -311.12dB, cycles: 44/smp

MSVC 2013 x64, core i7 2600

 FFT[ 4096] fwd:   27.3us (4493 mflops), inv:   24.0us, type: DeSoras<float>, diff:  -135.01dB, cycles: 43/smp
 FFT[ 4096] fwd:   32.0us (3838 mflops), inv:   32.8us, type:    Kiss<float>, diff:  -135.68dB, cycles: 55/smp
 FFT[ 4096] fwd:   17.7us (6950 mflops), inv:   17.9us, type: Pommier<float>, diff:  -135.29dB, cycles: 30/smp
 FFT[ 4096] fwd:   20.9us (5872 mflops), inv:   20.6us, type:   Ooura<float>, diff:  -135.88dB, cycles: 35/smp
 FFT[ 4096] fwd:   31.1us (3945 mflops), inv:   33.0us, type: DeSoras<double>, diff:  -310.71dB, cycles: 54/smp
 FFT[ 4096] fwd:   32.1us (3827 mflops), inv:   33.4us, type:   Kiss<double>, diff:  -310.85dB, cycles: 56/smp
 FFT[ 4096] fwd:   22.4us (5497 mflops), inv:   22.1us, type:  Ooura<double>, diff:  -311.15dB, cycles: 38/smp

XCode 5, x64, core i7 4790s

 FFT[ 4096] fwd:   41.3us (2978 mflops), inv:   42.3us, type: DeSoras<float>, diff:  -135.18dB, cycles: 65/smp
 FFT[ 4096] fwd:   33.4us (3681 mflops), inv:   37.5us, type:    Kiss<float>, diff:  -134.95dB, cycles: 55/smp
 FFT[ 4096] fwd:   20.3us (6045 mflops), inv:   26.5us, type: Pommier<float>, diff:  -135.35dB, cycles: 36/smp
 FFT[ 4096] fwd:   24.7us (4983 mflops), inv:   25.7us, type:   Ooura<float>, diff:  -135.22dB, cycles: 39/smp
 FFT[ 4096] fwd:    4.2us (28949 mflops), inv:    4.6us, type:    vDSP<float>, diff:  -135.13dB, cycles: 6/smp
 FFT[ 4096] fwd:   37.5us (3279 mflops), inv:   44.7us, type: DeSoras<double>, diff:  -308.56dB, cycles: 64/smp
 FFT[ 4096] fwd:   42.2us (2909 mflops), inv:   46.3us, type:   Kiss<double>, diff:  -308.96dB, cycles: 69/smp
 FFT[ 4096] fwd:   27.9us (4397 mflops), inv:   26.1us, type:  Ooura<double>, diff:  -309.61dB, cycles: 42/smp
 FFT[ 4096] fwd:   14.8us (8298 mflops), inv:   14.5us, type:   vDSP<double>, diff:  -308.47dB, cycles: 22/smp

#9

Btw, Ooura is included in the ICST DSP Library. I haven't gone back and checked, but I think this wraps it in C++ and floats already.

 

EDIT: went back and checked, and it does.


#10

Hey! Thanks Jules for the recent juce_FFT addition! :slight_smile:
Did some of you already give it a try guys?

Thanks for the additions to floatvectoroperations too Btw


#11

You may have forgotten to turn on some optimizations or SSE flags, pffft should be much above 6GFlops on a core i7 4790


#12

I wonder also if the test on FFTReal has been done with the fixed size at compilation algorithm or not


#13

I have just made more or less the same test than lorcan with FFTReal 2.11 (fixed size and dynamic size), Ooura FFT, JUCE FFT (lol), FFTW3, Intel MKL FFT, vDSP / accelerate FFT, and Julien Pommier's PFFFT.

FFTW3 / Intel / vDSP are clearly on another planet than the other ones, outperforming them with a large percentage. However, I have been also surprised with PFFFT's performance and how much it is easy to use. As a result, I'm going to use it in the future as a no brainer FFT library, instead of FFTReal or Ooura. You can also do a SIMD convolution with it with only one line of code. The only downside is it is currently float only.


#14

What's a 'large percentage' look like? :) 


#15

I developed with JUCE a minimal plug-in which loads impulse response files and do the convolution (overlap save + zero latency, no partitioning) with a fixed impulse response size of 1024 samples, and a FFT algorithm you can choose from a combobox.

I have included FFTReal 2.11, JUCE's FFT, FFTW3, Intel MKL FFT, the implementation of Ooura FFT from Lofi-Hifi for JUCE, and the Pretty Fast FFT from Julien Pommier + FFTPackv4 NETLIB.

With 128 samples of audio buffer latency in 48 kHz, on my Intel i5, the algorithm takes 6% of CPU for the JUCE FFT (lol), 1.0% for the FFTReal2 (fixed sizes), 1.2% for FFTReal 2 (dynamic sizes), 0.4-0.5% for the FFTW3, 0.4-0.5% for Intel, 0.6-0.7% for PFFFT and 1.1% for Ooura FFT. All done in Reaper 64, Windows 8.1.

That's a really raw test, but it should give you an idea. I might do a more accurate test later. But that's enough for me to choose PFFFT in my future projects.


#16

I've update my code since, and here are the results I get with an i7/2600k (Win 8.1 64)

  • 32bit, size 65536 = 2^16
fwd:   1.3ms   2.0 GFLOPS inv:   1.3ms   DeSoras<float> snr:  -81.93dB 133 c/s
fwd:   1.1ms   2.4 GFLOPS inv:   1.1ms      Kiss<float> snr: -134.74dB 111 c/s
fwd: 324.4us   8.1 GFLOPS inv: 360.4us     PFFFT<float> snr: -134.84dB  35 c/s
fwd: 263.1us  10.0 GFLOPS inv: 358.6us      FFTS<float> snr: -134.78dB  32 c/s
fwd: 690.7us   3.8 GFLOPS inv: 681.7us     Ooura<float> snr: -134.66dB  71 c/s
fwd: 207.3us  12.6 GFLOPS inv: 232.1us Intel IPP<float> snr: -134.77dB  22 c/s
  • 64bit
fwd: 993.0us  2.6 GFLOPS inv:   1.0ms   DeSoras<float> snr:  -81.93dB 105 c/s
fwd: 904.7us  2.9 GFLOPS inv: 901.1us      Kiss<float> snr: -134.74dB  93 c/s
fwd: 266.4us  9.8 GFLOPS inv: 262.6us     PFFFT<float> snr: -134.87dB  27 c/s
fwd: 138.8us 18.9 GFLOPS inv: 153.5us      FFTS<float> snr: -134.78dB  15 c/s
fwd: 557.6us  4.7 GFLOPS inv: 544.9us     Ooura<float> snr: -134.69dB  57 c/s
fwd: 153.8us 17.0 GFLOPS inv: 162.9us Intel IPP<float> snr: -134.38dB  16 c/s

FFTS really shines on 64bit, thanks to its dynamic code generator (currently not available under x86).  

Anthony Blake, the author of this library really did an amazing job here.

So unless you need double support, I definitely see little reason not to use FFTS.


#17

So finally it is possible to use FFTS with Windows ? Then I need to check this out, thanks for the report !

Edit : apparently, this is the starting point for using FFTS with VS (with CMake)

https://github.com/linkotec/ffts

http://www.cmake.org/


#18

Wouldn’t it be nice if somebody turned this into a JUCE module (hint, hint)!


#19

So I have finally been able to use FFTS with VS2013, it was a little painful because of the lack of documentation... Indeed, some information is available on the header ffts.h and on the test c files, but I have just figured out that the complex data is in the interleaved format(real and imaginary part of any complex data are adjacent), and I don't know I didn't see the first time that -1 must be used as a parameter to specify the forward FFT, and +1 for the inverse (didn't sound logical at all). I had also to change a few settings on the lib generation project made with CMake (static linkage, optimization stuff).

Right now, my convolution engine works with FFTS, and indeed it is faster than the one using PFFFT. The only downside is I don't understand why the complex data format isn't split format instead (like in FFTReal for example), that's so much better to perform SIMD convolution, which is also the opinion of FFTS author in his PhD thesis about FFTS !


#20

Hi All,

I know this is an old thread but this seems the best place!

Many of you seem to have tried out PFFFT and I'm only just playing with it for the first time. I'm getting some weird ordering (not the usual packing of Nyquist/DC, I know about that). Certainly using SIMD code I'm seeing blocks of four sample shuffled in weird ways (for example imaginary components 1, 2, 3 are at indices 5, 6, 7).

Anyone else seen that?