FIR Filtering with SIMDRegister

Hi All,

I’ve been experimenting recently with trying to improve the performance of my FIR Filtering algorithms, and I’ve figured out how to condense the bulk of the processing to a single “inner product” calculation, between the filter kernel and the filter state. std::inner_product doesn’t offer significant improvement over using a plain for loop, so I’ve been trying to use the juce::dsp::SIMDRegister to implement a faster inner product. The bulk of the algorithm looks like this:

// load unaligned data into SIMD register
inline dsp::SIMDRegister<float> loadUnaligned (float* x)
{
    dsp::SIMDRegister<float> reg (0.0f);
    for (int i = 0; i < dsp::SIMDRegister<float>::SIMDNumElements; ++i)
        reg.set (i, x[i]);

    return reg;
}

// inner product using SIMD registers
inline float simdInnerProduct (float* in, float* kernel, int numSamples, float y = 0.0f)
{
    constexpr size_t simdN = dsp::SIMDRegister<float>::SIMDNumElements;

    // compute SIMD products
    int idx = 0;
    for (; idx <= numSamples - simdN; idx += simdN)
    {
        auto simdIn = loadUnaligned (in + idx);
        auto simdKernel = dsp::SIMDRegister<float>::fromRawArray (kernel + idx);
        y += (simdIn * simdKernel).sum();
    }

    // compute leftover samples
    y = std::inner_product (in + idx, in + numSamples, kernel + idx, y);

    return y;
}

Note that the filter kernel is forced to be correctly SIMD-aligned, but for the filter state (in), I need to perform a loadUnaligned() operation. I’ve found that on my Windows machine, this implementation offers significant improvement, while on Mac and Linux, the performance is considerably worse than std::inner_product. On Mac this isn’t a huge issue, since I can use vDSP_dotpr from the Accelerate library, which (I believe) uses SIMD instructions internally, and gives similar performance improvements.

I’m mostly wondering why this discrepancy in performance exists, and how I could tweak my algorithm to be better. TBH, I’m not the most knowledgeable when it comes to SIMD-related things, but I’m trying to learn, so any new information (even if it might be considered “basic”) would be helpful.

For more information on my FIR filtering experiments, check out my GitHub repo.

Thanks,
Jatin

Whether loading unaligned is slower than loading aligned depends on the CPU you are using. Newer CPUs can do both at the same speed, but anything pre i3/i5/i7 is a lot slower when loading unaligned data. Are you using the same CPU for Windows and Mac OS X?
Unaligned access often also means less than ideal cache behaviour which decreases performance.

Interesting… The Windows and Linux tests were using the same CPU, the Mac test was using a newer (and all-around more powerful CPU). Something I’ve seen people do is have 4 copies of the state vector, each shifted by one sample, and all aligned to the correct byte boundaries, I think this would allow for fewer unaligned loads, but might introduce a bit more overhead to set up.

That said, I just ran a test on Linux using no unaligned loads, and the SIMD version is still about 3x slower than std::inner_product:

// compute SIMD products
int idx = 0;
for (; idx <= numSamples - simdN; idx += simdN)
{
    auto simdIn = dsp::SIMDRegister<float>::fromRawArray (kernel + idx); // THIS IS INCORRECT, JUST FOR PERF. TESTING
    auto simdKernel = dsp::SIMDRegister<float>::fromRawArray (kernel + idx);
    y += (simdIn * simdKernel).sum();
}

I’ll keep experimenting…

You could also process the first few samples outside the loop using extra logic like what you do at the end (“compute leftover samples”) until you reach an aligned sample. Then the loop can use aligned loads.

1 Like