As someone who did performance optimization consulting for 10+ years and has been extensively benchmarking IPP/vDSP functions the last months (literally was doing it when reading this thread) I was ready to fly into this thread guns blazing, 100% sure that mistakes had been made. vDSP will always be faster!
So lets test it with a real benchmark library, with cache warming, guards against compiler optimizing away the result, a decent n number of iterations, etc! And actually use the correct vDSP call for this use case!
So I tried it out with 512 samples in a plain old std::vector.
Uh oh…
benchmark name samples iterations estimated
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
loop 100 212 2.2684 ms
106.375 ns 105.85 ns 107.404 ns
3.55755 ns 1.91206 ns 5.38678 ns
FloatVectorOperations copy then add 100 53 2.2843 ms
440.438 ns 439.298 ns 444.786 ns
10.0575 ns 2.32347 ns 23.3693 ns
vDSP_vsmsma 100 44 2.2924 ms
542.155 ns 536.908 ns 556.661 ns
41.1198 ns 17.1323 ns 87.2612 ns
Code: Catch2 benchmarks for 512 samples
SECTION ("nerd sniped")
{
std::vector<float> A;
std::vector<float> B;
std::vector<float> result;
A.resize (512);
B.resize (512);
result.resize (512);
float alpha = 3.5f;
float beta = 1.2f;
BENCHMARK ("loop")
{
for (size_t i = 0; i < result.size(); ++i)
{
result[i] = A[i] * alpha + B[i] * beta;
}
return result;
};
BENCHMARK ("FloatVectorOperations copy then add")
{
juce::FloatVectorOperations::copyWithMultiply (result.data(), A.data(), alpha, 512);
juce::FloatVectorOperations::addWithMultiply (result.data(), B.data(), beta, 512);
return result;
};
BENCHMARK ("vDSP_vsmsma")
{
vDSP_vsmsma (A.data(), 1, &alpha, B.data(), 1, &beta, result.data(), 1, 512);
return result;
};
}
At 512 samples, not only is the raw loop faster, but the vDSP specific call is the worst performer. 
What about smaller sample blocks? Here’s 64 samples:
benchmark name samples iterations estimated
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
loop 100 637 2.2295 ms
34.8611 ns 34.8395 ns 34.8957 ns
0.1366 ns 0.0943064 ns 0.246368 ns
FloatVectorOperations copy then add 100 529 2.2218 ms
43.3149 ns 42.8108 ns 44.8642 ns
3.93341 ns 0.665487 ns 8.53929 ns
vDSP_vsmsma 100 626 2.2536 ms
35.9621 ns 35.9349 ns 36.0452 ns
0.221448 ns 0.0873736 ns 0.490156 ns
Interesting! At 64 samples, vDSP_vsmsma and the raw loop are now about equal.
BUT WAIT THERE’S MORE!!
Right! I was using std::vector above, with no attention to alignment.
Check things out with AudioBlock, with its default alignment of sizeof (SIMDRegister<NumericType>)
benchmark name samples iterations estimated
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
loop 100 2006 2.2066 ms
11.2967 ns 11.2277 ns 11.6072 ns
0.638054 ns 0.0714107 ns 1.51036 ns
FloatVectorOperations copy then add 100 1209 2.1762 ms
18.8336 ns 18.5334 ns 19.3757 ns
2.00456 ns 1.31133 ns 3.17008 ns
FloatVectorOperations add then add 100 1097 2.194 ms
21.4337 ns 21.0371 ns 22.3376 ns
2.87391 ns 1.53503 ns 5.18361 ns
vDSP_vsmsma 100 1966 2.1626 ms
10.7294 ns 10.7214 ns 10.753 ns
0.0649882 ns 0.0280217 ns 0.141415 ns
Code for Audioblock, 64 samples
SECTION ("nerd sniped")
{
// use audio blocks to ensure alignment
juce::HeapBlock<char> aData;
juce::dsp::AudioBlock<float> a = { aData, 1, 512 };
juce::HeapBlock<char> bData;
juce::dsp::AudioBlock<float> b = { bData, 1, 512 };
juce::HeapBlock<char> resultData;
juce::dsp::AudioBlock<float> result = { resultData, 1, 512 };
float alpha = 3.5f;
float beta = 1.2f;
BENCHMARK ("loop")
{
for (int i = 0; i < (int) result.getNumSamples(); ++i)
{
result.setSample(0, i, alpha * a.getSample(0, i) + beta * b.getSample(0, i));
}
return result.getChannelPointer(0);
};
BENCHMARK ("FloatVectorOperations copy then add")
{
juce::FloatVectorOperations::copyWithMultiply (result.getChannelPointer(0), a.getChannelPointer(0), alpha, 64);
juce::FloatVectorOperations::addWithMultiply (result.getChannelPointer(0), b.getChannelPointer(0), beta, 64);
return result.getChannelPointer(0);
};
BENCHMARK ("vDSP_vsmsma")
{
vDSP_vsmsma (a.getChannelPointer(0), 1, &alpha, b.getChannelPointer(0), 1, &beta, result.getChannelPointer(0), 1, 64);
return result.getChannelPointer(0);
};
}
Vindicated! 

(Just barely!)
With properly aligned memory, FloatVectorOperations is over 2x faster than the raw loop. With the vDSP call specific to the need, it’s just about 5x faster than the aligned raw loop and 10x faster than the unaligned raw loop.
Edit, whups, got greedy there, there was a convenient typo — At 64 samples, the “right” vDSP function is only slightly faster than the raw loop. But note the raw loop is almost 3x faster when aligned!
What about with AudioBlock’s default alignment with 512 samples?
benchmark name samples iterations estimated
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
loop 100 459 2.2491 ms
49.9297 ns 48.795 ns 53.1631 ns
8.86625 ns 3.14064 ns 18.6949 ns
FloatVectorOperations copy then add 100 32 2.3136 ms
719.249 ns 713.638 ns 737.375 ns
45.3276 ns 10.7257 ns 98.9282 ns
vDSP_vsmsma 100 62 2.2754 ms
356.814 ns 355.94 ns 358.407 ns
5.84885 ns 3.14928 ns 9.64479 ns
Yikes, looks like the raw loop wins.
So the lesson learned…alignment matters. And for simple loops, it’s better to benchmark vectorized versions before making assumptions.
I’ve been learning a lot of these lessons over and over again lately (be careful making assumptions about performance, be careful about making too many generalizations, triple check all the numbers). For example, I’m not 100% convinced that returning the std::vector/raw pointer is truly enough to avoid the compiler from optimizing some of the code away, but I’ve tried a few alternatives and got the same consistent result. It’s why I prefer tools like perfetto than can measure real time performance in-app.