You might want to consider Ooura FFT http://www.kurims.kyoto-u.ac.jp/~ooura/fft.html
I has a very permissive license and great performance, altough the source code needs a little tweaking to use in a c++ wrapper.
I've made a little comparison of various FFT packages below ('diff' is the relative error in dB)
- Laurent DeSoras FFTReal 2.11 http://ldesoras.free.fr/prod.html#src_audio
- KissFFT 1.30 http://sourceforge.net/projects/kissfft
- Julien Pommier's PFFFT https://bitbucket.org/jpommier/pffft (float only)
- Ooura (double only, can be adapted to float)
- Apple vDSP https://developer.apple.com/library/mac/documentation/Performance/Conceptual/vDSP_Programming_Guide/Introduction/Introduction.html
As you can see Apple vdsp is very well optimized (~2x throughput for doubles), but Ooura FFT is the best performer for cross-platform liberal-licensed packages.
Also 64-bit compilation seems to give better performance.
MSVC 2013 x86, core i7 2600
FFT[ 4096] fwd: 36.6us (3358 mflops), inv: 33.9us, type: DeSoras<float>, diff: -135.01dB, cycles: 60/smp
FFT[ 4096] fwd: 37.2us (3304 mflops), inv: 37.7us, type: Kiss<float>, diff: -135.68dB, cycles: 64/smp
FFT[ 4096] fwd: 23.7us (5193 mflops), inv: 26.7us, type: Pommier<float>, diff: -135.29dB, cycles: 43/smp
FFT[ 4096] fwd: 30.3us (4057 mflops), inv: 23.1us, type: Ooura<float>, diff: -135.88dB, cycles: 45/smp
FFT[ 4096] fwd: 41.7us (2949 mflops), inv: 41.6us, type: DeSoras<double>, diff: -310.71dB, cycles: 71/smp
FFT[ 4096] fwd: 47.5us (2585 mflops), inv: 41.7us, type: Kiss<double>, diff: -310.60dB, cycles: 76/smp
FFT[ 4096] fwd: 26.5us (4638 mflops), inv: 25.0us, type: Ooura<double>, diff: -311.12dB, cycles: 44/smp
MSVC 2013 x64, core i7 2600
FFT[ 4096] fwd: 27.3us (4493 mflops), inv: 24.0us, type: DeSoras<float>, diff: -135.01dB, cycles: 43/smp
FFT[ 4096] fwd: 32.0us (3838 mflops), inv: 32.8us, type: Kiss<float>, diff: -135.68dB, cycles: 55/smp
FFT[ 4096] fwd: 17.7us (6950 mflops), inv: 17.9us, type: Pommier<float>, diff: -135.29dB, cycles: 30/smp
FFT[ 4096] fwd: 20.9us (5872 mflops), inv: 20.6us, type: Ooura<float>, diff: -135.88dB, cycles: 35/smp
FFT[ 4096] fwd: 31.1us (3945 mflops), inv: 33.0us, type: DeSoras<double>, diff: -310.71dB, cycles: 54/smp
FFT[ 4096] fwd: 32.1us (3827 mflops), inv: 33.4us, type: Kiss<double>, diff: -310.85dB, cycles: 56/smp
FFT[ 4096] fwd: 22.4us (5497 mflops), inv: 22.1us, type: Ooura<double>, diff: -311.15dB, cycles: 38/smp
XCode 5, x64, core i7 4790s
FFT[ 4096] fwd: 41.3us (2978 mflops), inv: 42.3us, type: DeSoras<float>, diff: -135.18dB, cycles: 65/smp
FFT[ 4096] fwd: 33.4us (3681 mflops), inv: 37.5us, type: Kiss<float>, diff: -134.95dB, cycles: 55/smp
FFT[ 4096] fwd: 20.3us (6045 mflops), inv: 26.5us, type: Pommier<float>, diff: -135.35dB, cycles: 36/smp
FFT[ 4096] fwd: 24.7us (4983 mflops), inv: 25.7us, type: Ooura<float>, diff: -135.22dB, cycles: 39/smp
FFT[ 4096] fwd: 4.2us (28949 mflops), inv: 4.6us, type: vDSP<float>, diff: -135.13dB, cycles: 6/smp
FFT[ 4096] fwd: 37.5us (3279 mflops), inv: 44.7us, type: DeSoras<double>, diff: -308.56dB, cycles: 64/smp
FFT[ 4096] fwd: 42.2us (2909 mflops), inv: 46.3us, type: Kiss<double>, diff: -308.96dB, cycles: 69/smp
FFT[ 4096] fwd: 27.9us (4397 mflops), inv: 26.1us, type: Ooura<double>, diff: -309.61dB, cycles: 42/smp
FFT[ 4096] fwd: 14.8us (8298 mflops), inv: 14.5us, type: vDSP<double>, diff: -308.47dB, cycles: 22/smp