[FR] Vectorized summation

It would be great to have a cross-platform function to sum vectors. Something along the lines of float FloatVectorOperations::sum(const float *vec, int n) and it’s double equivalent. I’ve run across several uses already in dsp and analysis situations.

This is an honest question, not snark. Why wouldn’t you write this yourself? I’m coming from the world of python, and I’m familiar with loops being slow there, but I thought vectorization was just a loop created by a more performant language like cython (C/C++) which we are already using in JUCE.

If you write a regular for-loop yourself, there’s no guarantee the compiler is going to vectorize it automatically. That’s why it would be preferred to have a manually written vectorized version. (It maybe isn’t that hard to do but it would be a nice facility to have directly available as a Juce function.)

I’m confused here. What do you mean by “vectorize” here? I thought you were just talking about summing the values in a std::vector. What C++ syntax would accomplish such a thing? You mean like a std::for_each() function? Or…?

I see I had a misunderstanding. I just finished doing a bit of reading and yes this is a bit lower level than I’m used to dealing with, and based on efficiency gains created by SIMD.

There’s so many overlaps in terminology between technical fields, it’s easy to think you know something and then you turn out not to have any idea.

“Vectorize” as in “use SIMD instructions” in the hopes it will make the execution faster.

and not that it matters, since it’s not what the OP was talking about, but to sum a vector you would likely want to use std::accumulate:nerd_face:

2 Likes

Something like that. Although I’m don’t handle the juce syntax with their simd macro

float JUCE_CALLTYPE FloatVectorOperations::sum (const float* src, int num) noexcept
{
  float sum = 0
  #if JUCE_USE_VDSP_FRAMEWORK
     vDSP_vse((float*) src, 1, &sum (vDSP_Length) num);
  #else
     assert(0)
  #endif
  return sum;
}

Indeed SIMD is the more accurate term here. I wrote one myself both for mac and SSE intrinsics. But it seems a frequent enough function that it would make sense to have a canonical implementation right there with the other functions i.e. SIMD multiply, add, etc.
E.g. my implementation is tailored to my special needs and e.g. assumes memory is aligned to 16 byte boundaries. A more generalized implementation would be better. And say we want to port the code to ARM we’d have to remember to add a special implementation there.
Both clang and Visual Studio with optimisation turned all the way up didn’t vectorize the trivial loop automatically and doing it manually yielded a very measurable difference in performance.