No performance improvement with FloatVectorOperations

I rewrote some code to use FloatVectorOperations. I’d been depending on a big boost from this from the start of the project, so I was worried when there was no performance improvement. The good news is that when I replaced FloatVectorOperations with Accelerate Framework calls on a mac, total CPU use dropped to about 25% of what it was previously. But I’d rather use something totally cross-platform. I’ve tried throwing all the necessary compiler options at the problem, but nothing seems to have an effect. My vectors are all doubles and I’m just doing basic multiply+add and multiply+sub operations.

Is there some flag I need to set somewhere to enable using SIMD to accelerate these calls?

BTW, one fairly common (I think) operation that I noticed is missing in FloatVectorOperations is summing a vector.

1 Like

What’s wrong with FloatVectorOperations::add (float * dest, const float * src, int numValues)?
Or did I miss something?

About improvement, you are comparing against what? Note, that using AudioBuffer::addFrom already uses FloatVectorOperations…

And surely you are measuring in release mode…?

What’s wrong with FloatVectorOperations::add (float * dest, const float * src, int numValues)?

I want to sum a vector of floats to a single float, not do an element-wise addition of two vectors.

About improvement, you are comparing against what?

Unoptimized for loops. The Accelerate Framework functions yielded a 4x performance improvement overall (so probably more like 5-6x in this code). If it was actually using SSE, it should have made a very noticeable improvement.

And surely you are measuring in release mode…?

I’m fairly sure that I tested both release and debug builds. Are vector operations disabled in debug builds? (I’m absolutely certain that I tested like against like, not a release build against a debug build.)

I see. I am not 100%, but I think that is not a use case for SIMD, since SIMD is meant to apply one instruction with predefined data to multiple data elements. But I could be mistaken…

Just a guess, but I saw some apple documents which stated that accelerate uses avx if possible. Maybe you‘re experiencing the speedup behause FloatVectorOperations only uses sse? And maybe your Compiler already vectorized some of your code automatically, so there was no speedup when applying manual vectorization?

I want to sum a vector of floats to a single float, not do an element-wise addition of two vectors.

I see. I am not 100%, but I think that is not a use case for SIMD, since SIMD is meant to apply one instruction with predefined data to multiple data elements. But I could be mistaken…

That’s quite possible! Accelerate has a function for it, but I did not dig into the implementation and I’m not in any way an expert on the actual instructions available.

As I said, I am not an expert on low level either, so lets hope, the add function will be added… :slight_smile:

@bobwalker this x1000. AVX will make a huge difference - up to 2x SSE since it operates over twice as much data in the same number of cycles.

Additionally, there’s probably some extreme low level craziness going on in Accelerate to push every possible cycle out of the hardware. I’m guessing Apple probably goes to the level of optimizing their algorithms for each processor family to optimize per-instruction ordering based on instruction latency and whatnot.

1 Like

And maybe your Compiler already vectorized some of your code automatically, so there was no speedup when applying manual vectorization?

I did consider that that might be the case, when I first got no improvement. I rebuilt with -fno-vectorize and it didn’t make a significant difference. My original code was full of heavily interleaved operations, since I planned to convert it completely later. Even llvm would have had a really hard time untangling all of it.

I went ahead and re-tested all three approaches, making sure everything was building for release. The numbers below are average CPU load for an extended steady state.

  • No explicit SIMD, llvm auto-vectorization disabled: 10.9%
  • No explicit SIMD, llvm auto-vectorization enabled: 10%
  • FloatVectorOperations: 12.2%
  • Accelerate Framework: 6%

You can see that the auto-vectorization made only a very small difference. In debug builds, accelerate is 4x faster, but in release builds it’s only about 2x faster. The rest of the performance gain is eaten up by -O3 vs -O0. No surprise.

What is surprising is that FloatVectorOperations is slightly slower than doing no vectorization at all. I’m still assuming that I’ve just somehow got my build configured incorrectly and it’s running in some sort of fallback mode, but I’m not sure what’s wrong.

I’m guessing Apple probably goes to the level of optimizing their algorithms for each processor family to optimize per-instruction ordering based on instruction latency and whatnot.

Possibly. I’m not going to be disappointed if it’s a little slower than Accelerate. I just want it to run faster than no vectorization at all. :slight_smile:

Interesting. Any chance you could whip up a test app that does similar operations that can produce similar performance results? I’m really into SIMD stuff so I’d be curious to take a look at your code and see if there’s anything which could be causing performance issues.

Or perhaps you could run a time profile via Instruments and see which FloatVectorOperations functions are taking the longest?

Half a year ago, I was developing my own extended set of float vector operations as I’m working on a project making heavy use of complex-valued vectors which need some different kind of calculations than usual real-valued vectors when multiplying them or calculating absolute values. To get the best performance, I did some benchmarks and found out something quite interesting: Performance gain also was not that great when just exchanging regular for-loops with FloatVectorOperations calls. But when slicing my data into sub-vectors that exactly matched a CPU cache line size and then iterating over these sub vectors in a for loop, while doing the actual computation on these sub vectors with FloatVectorOperations caused a huge performance boost.
I thought I did some better documentation of these benchmarks back then, but I can’t find it anywhere anymore. Anyway, I came to the assumption that just using FloatVectorOperations onto bigger Vectors leads to a memory-based bottleneck, while slicing the vectors in cache-line sized subvectors somehow allowed the compiler to better optimize the code and feed the next cache line in the background with the next sub-vector while still working on the last one with simd operations.

I don’t know if this also applies to your case, but I could imagine that maybe accelerate does something similar under the hood, asides from just using avx instead of sse calls.

4 Likes

When you ran these tests, did you disable JUCE_USE_VDSP_FRAMEWORK ? Many of the FloatVectorOperations map to vdsp calls internally so there shouldn’t be any difference for those as long as JUCE_USE_VDSP_FRAMEWORK is 1. Did you make sure your buffers are aligned? That certainly makes a huge difference for the ones done by Juce itself. And lastly - what buffer sizes did you use?

2 Likes

When you ran these tests, did you disable JUCE_USE_VDSP_FRAMEWORK ? Many of the FloatVectorOperations map to vdsp calls internally so there shouldn’t be any difference for those as long as JUCE_USE_VDSP_FRAMEWORK is 1.

Nope. That’s the sort of flag I was looking for, but it looks like it’s on by default. I tried explicitly setting it to 1 and it didn’t make any difference. But setting it to zero caused CPU use to jump from about 12% to 14.8% (without -O3 it jumps to 90% and starts glitching :sweat_smile:). So FloatVectorOperations is definitely invoking the vdsp functions. I checked the source briefly and it looks like it’s invoking the functions with the same parameters I am… except with enough overhead that there’s no benefit vs. not using vector operations at all. Maybe llvm is just not inlining the functions properly or something.

Sounds like it might be the compiler’s fault, I don’t see any issues with Juce, and I can at least confirm now that my build is set up correctly to use it. Accelerate gives me access to some FMA instructions I can make good use of (e.g., vDSP_vmsbD), so maybe I’ll just stick with invoking it directly.

Did you make sure your buffers are aligned? That certainly makes a huge difference for the ones done by Juce itself. And lastly - what buffer sizes did you use?

The vectors are 256 doubles long, although all values are not always used. (In the test I do, n=256 in all of the vector calls.) They’re currently aligned to 64 bytes, although I haven’t seen any real difference between that and 16.

1 Like

Ok… that’s a weird result then. Btw if you want truly cross-platform simd without extreme pain that also compiles to avx if needed, you could look into libsimdpp. It is partially similar to that new SIMD type in the juce dsp module, but much more complete and supports many types of SIMD instructions. For my stuff I find that much faster than the FloatVectorOperations because I can bundle multiple steps in a medium sized loop and therefore a lot less loads and unloads are needed.

3 Likes

I take back my recommendation for simdpp. I just debugged for hours to find a problem with my code on windows and as I finally looked at the generated assembly I found a load instruction was reduced to nothing during compilation (in release mode with full optimization only on Visual Studio 2017). It’s such a nested template jungle that I have no idea how to even begin to debug this problem, but it could also be a compiler bug… in any case it seems unreliable on Visual Studio 2017 for now.

2 Likes

Did you turn on “Link-Time Optimisation”?
(In the Projucer under Xcode/Release or in Xcode as “Build Settings”/“Code Generation”/“Link-Time Optimization”=Monolithic)

The “cleverness” of libsimdpp’s design is supposed to be that everything is a template which gets reduced to equivalent native SIMD instructions at compile time (or at runtime with function-level granularity if you use the dynamic dispatch mechanism). It’s a nightmare to debug at both the compile time and run time levels. IMO the “encapsulating object” approach with aggressive inlining in JUCE’s dsp:: SIMD classes is the best way to handle cross-architecture SIMD.

I agree, but as I wrote in the other thread, the Juce SIMDRegister performs quite badly in debug builds because every single intrinsic is called as a static method. Compared to that simdpp produced debug code for me that was the same I’d get from using the intrinsics directly.

I have a similar question. Here is a basic code.

Tested with my old Apple’s A12X Bionic. I deliberately increased the number of operations, just for the test and these functions are all inline and vectorization is also enabled

CPU Usage: 75%

for (int i = 0; i < Setup.PhaseLength; ++i) {
   currentSample[i] = (sampleA[i] * alpha) + (sampleB[i] * beta);
}

CPU Usage: 89%

FloatVectorOperations::copyWithMultiply(currentSample, sampleA, alpha, Setup.PhaseLength);
FloatVectorOperations::addWithMultiply(currentSample, sampleB, beta, Setup.PhaseLength);

I expected that there would be an increase, or at least the same as for loop. I assume this loop just vectorized. But is vectorized way faster than Apple’s AVX operations?

Even std::copy() is faster than FloatVectorOperations::copy for me.

First of all, “CPU Usage” is a relatively broad metric. If you want to get real insights about execution speed, I’d suggest you to writing benchmarks or implement other timing helpers to really find out the time a certain implementation takes.

Then, there can be multiple factors that come into play. The first is cache, when the buffer size gets big, you will hit the case where your data does not fit the CPU cache anymore. Optimising memory access patterns is often the most important optimisation. We see, that in the FloatVectorOperations case, the currentSample buffer is accessed twice. Now if it’s too big for the cache you will produce more cache misses in the second variation compared to the first one where it’s only accessed once.

Besides that, compilers are pretty good at optimising loops nowadays. In the for-loop case, the compiler sees the entire operation, maybe even knows Setup.PhaseLength as a compile time constant and can see the bigger picture. In case of the FloatVectorOperations, you call a function defined in a different translation unit – this could be optimised by link time optimisations, but won’t be inlined by the compiler – which then in turn also calls functions from a dynamic library, which cannot even be inlined at all.

Last but not least, you mention Apple’s AVX operations and the A12X Bionic CPU. The A12X is an ARM CPU, which has the NEON SIMD instruction set, while AVX is an x64 instruction set – however, besides having only 128 bit wide registers compared to 256 bit wide registers this is more or less only a naming thing, the general technical approach stays the same :wink:

TLDR: Optimisation is a super complex topic and you should have a good knowledge of how CPUs work and how a compiler works in order to make informed decisions. Even then, benchmarking is still the most relevant way to find out which variation is the best under given circumstances. There is no one-fits-all answer to what is the fastest approach to something in general.

2 Likes