Need help with Non-Uniformly partitioned convolution

I’m trying to Implement a Overlap Save Non Uniformly partitioned convoluter in C++. I need help with calculating the position at which the output of a certain size should be added to the processed_buffer.

lets say processed_buffer is a very large buffer, from where the output is read.

Lets say our filter is, P = { [B1]^n1, [B2]^n2, …}(where Bi is the size of the 'i’th partition and ‘ni’ is the number of parts with size ‘Bi’),

So the latency of the system would be B1 samples.
we start writing to processed_buffer from B1 index, but the output will be read from 0(this maintains the latency).

My Understanding of the process loop till now :

1 → Maintain an array of queues where each queue stores the results for each partition size(the number of elements in that queue would be equal to ‘ni’), lets call this array DFT_history.

2 → When you have new ‘Bi’ number of samples after the last process(or for overlap take overlapping parts) for a partition size, you process it (FFT on the new Bi unprocessed samples).

3 → When the deadline arrives for a certain partition size, push the new FFT’d block(generated from the last process block for this size) to the corresponding sized queue in DFT_history, pop the oldest block, multiply the corresponding partitions of FFT’d IR and the DFT_history for that size, IFFT on the results individually, add them and write(“+=” write) it to the processed buffer.

  • Is there anything wrong with my understanding of the process loop ?
  • Do we need to pad IR segments with before FFT’ing and storing it as FFT’d IR ?
  • How to calculate the deadline for a segment with partition size Bi in samples ?
  • What would be the index at which we would write results from a certain partition size to the processed_buffer ?

Any help would be Appreciated.

The following picture from Partitioned convolution algorithms for real-time auralization might be helpful (so that you can check your understandings).

TBH, I gave it up because of the complex dependencies between deadlines. And I am afraid that because I cannot re-use the FFT results of audio samples, the increase in time complexity is too huge.

Thank you for the reply, I did indeed read that publication, but I am still not sure about the deadlines, According to the book a block of size Bi will be processed every time new Bi number of samples are available for it, which is counter intuitive because that means it does not depend on number of parts for each size at all, which is not possible as the Indexes should be consistent even with different partition plans.

Hey,

I had to implement something similar few years ago. It was mostly based on the paper “Optimal Filter Partitions For Non-Uniformly Partitioned Convolution”, and tbf that was quite a hairy one to implement. I don’t have the time to melt my brain again on this, but IIRC:

  • The paper suggested to divide the signal as you described, and computes the head with a FIR filter so you have 0 latency. I think the paper calls each block of ni chunks of size Bi a “frequency delay line”.
  • I maintained a global “clock” incremented in my audio callback, and passed this global clock to each Bi when i pushed samples to fill it so that once it’s been computed I can deduce where you need to add it. Each Bi had a status (something like idle=you know you can fill it, filling=being filled, full=can be picked by worker thread to be computed, done=output ready, audio process callback can use this one when clock will reach that point)
  • I had a set of threads spinning, one for each Bi (i.e. one thread to compute the ni parts). So if you have 3 sizes you’ll have 3 worker threads.

To get the algorithm to work i did not involve threads at the beginning but tested it on the message thread, triggering the computation of the Bi manually once they are filled etc, then compared the result obtained to a set of convolution data from numpy or something like that.

Regarding deadlines for computation on worker threads, i think when picking a new “task” to compute, the Bi with the lowest “global clock” got the priority. But you cannot guarantee anything in the end, so your when audio callback comes in it picks whatever is available, and if for some reason something ran late then… tough :confused: But in my experience once you get it right and tweak a bit the partitions sizes etc this does not happen.

More recently i had to do another “multithreaded convolution” and went for a uniformly partitioned which is way easier (: I based my implementation around this : GitHub - HiFi-LoFi/FFTConvolver: Audio convolution algorithm in C++ for real time audio processing . Maybe this can help you with your indices issues.

Anyway, hope this helps and good luck.

My implementation is somewhat similar, I have a worker thread pool that i can offload my work to, they process and store the results in a temporary container.

I maintained a global “clock” incremented in my audio callback

I have some similar mechanism called ticks, a tick represents that we have B1 number of new samples, because all the sizes are some power of 2 and > B1 they get dispatch calls on one of the ticks if enough samples are available.
And when the deadline arrives for a particular size we sync the new changes to the wet buffer and the queues.

The paper suggested to divide the signal as you described, and computes the head with a FIR filter so you have 0 latency. I think the paper calls each block of ni chunks of size Bi a “frequency delay line”.

It is possible to make 0 Latency FIR filters ?

Most of my implementation is done including syncing, distributing work among threads, and running non busy wait threads when no work is available, deadlines and mixing from different partition sized queues are still left to be implemented.

Thank you for the resources I will check them out.

Yes it is. Straight up convolution is indeed FIR filtering. This basically, blue is the filter (i.e. the impulse response) and red is the signal: Convolution - Wikipedia

Sorry, I meant to ask if 0 Latency is possible when using a Fast Convolution ?

If what you mean by “fast convolution” is “partitioned convolution where you compute chunks on worker thread(s)”, then the answer is yes if you consider the first bit of you impulse response (let’s say 256 or 512 samples) as a FIR filter and perform “direct convolution” (i.e. FIR filtering) in your audio callback directly, and the remaining part of your impulse response is computed on background thread(s) as you described.

Oh, Thanks for the info.

Thank you for the reply, I did indeed read that publication, but I am still not sure about the deadlines, According to the book a block of size Bi will be processed every time new Bi number of samples are available for it, which is counter intuitive because that means it does not depend on number of parts for each size at all, which is not possible as the Indexes should be consistent even with different partition plans.

If I understand the book correctly, each horizontal black segment represents all sub-filter with the same size. Therefore, when you have collected B samples (at time S0), you need to do the FFT, then do two spectrum multiplication (corresponding to the first and the second sub-filter, I think the second needs a delay ahead) and then do iFFT.