Hi All,
I’ve been experimenting recently with trying to improve the performance of my FIR Filtering algorithms, and I’ve figured out how to condense the bulk of the processing to a single “inner product” calculation, between the filter kernel and the filter state. std::inner_product doesn’t offer significant improvement over using a plain for loop, so I’ve been trying to use the juce::dsp::SIMDRegister to implement a faster inner product. The bulk of the algorithm looks like this:
// load unaligned data into SIMD register
inline dsp::SIMDRegister<float> loadUnaligned (float* x)
{
dsp::SIMDRegister<float> reg (0.0f);
for (int i = 0; i < dsp::SIMDRegister<float>::SIMDNumElements; ++i)
reg.set (i, x[i]);
return reg;
}
// inner product using SIMD registers
inline float simdInnerProduct (float* in, float* kernel, int numSamples, float y = 0.0f)
{
constexpr size_t simdN = dsp::SIMDRegister<float>::SIMDNumElements;
// compute SIMD products
int idx = 0;
for (; idx <= numSamples - simdN; idx += simdN)
{
auto simdIn = loadUnaligned (in + idx);
auto simdKernel = dsp::SIMDRegister<float>::fromRawArray (kernel + idx);
y += (simdIn * simdKernel).sum();
}
// compute leftover samples
y = std::inner_product (in + idx, in + numSamples, kernel + idx, y);
return y;
}
Note that the filter kernel is forced to be correctly SIMD-aligned, but for the filter state (in), I need to perform a loadUnaligned() operation. I’ve found that on my Windows machine, this implementation offers significant improvement, while on Mac and Linux, the performance is considerably worse than std::inner_product. On Mac this isn’t a huge issue, since I can use vDSP_dotpr from the Accelerate library, which (I believe) uses SIMD instructions internally, and gives similar performance improvements.
I’m mostly wondering why this discrepancy in performance exists, and how I could tweak my algorithm to be better. TBH, I’m not the most knowledgeable when it comes to SIMD-related things, but I’m trying to learn, so any new information (even if it might be considered “basic”) would be helpful.
For more information on my FIR filtering experiments, check out my GitHub repo.
Thanks,
Jatin
