Better strategy for channel coherence?

If updating parameters in realtime (like filter coefficients) on a per sample level, the iteration order on buffers normally seen in most modules opposes channel coherence (when not mono). There will not be coherence during an interpolated update between two parameter values when using this:

for channel in channels
    for sample in buffer.       
        do interesting stuff

There can be an update between each sample so the update is not syncronized between channels.

Intead we would need to consider the “samples” as compound of it’s channels i.e. reverse the order:

for sample in buffer
    for channel in sample
        do interesting stuff

The number of iterations are the same but in the second case we get many small short loops and one long instead of two long ones. So the number of loops is much higher.

Since this is seldom seen, anyone who has an idea of how much vorse this second strategy is in terms of effectiveness?

Or maybe there’s some other better solution to this I’ve missed?

What I usually do to mitigate that problem is to read all values at the beginning of the processBlock().

Looking up the parameter inside the loop is a waste anyway and could lead to the problem you are describing.

1 Like

Leaving the problem of changing parameters aside for a moment, looping over the channels in the outer loop and over the samples in the inner loop is a lot more efficient. The reason for that is that the data is laid out channel wise in memory, that means adjacent samples for a channel are adjacent in memory. On the hardware level, once you access the first sample in a channel buffer, the CPU will load a bunch of adjacent bytes following the memory location you just accessed into its cache memory, which means that the following samples will be accessible a lot faster by the hardware as they are already in the CPU cache. If you loop over the samples in the outer loop and then loop over the channels in the inner loop, you will create a memory access pattern where each access will be somewhere completely else in your RAM, which will in turn likely cause a lot more cache misses – that means your CPU has to prefetch the required memory from the RAM to the cache first before the actual computation is executed, which will slow down your execution a bit.

Now with that knowledge, let’s go back to your original question. If you want sample accurate parameter handling you might want to accept the disadvantage and trade the inefficient code for accuracy. But you should also ask yourself if you can even expect your parameter values to change that often? At the time of writing, the major DAWs that I know of will update the parameters once per block if you automate them, some tend to slice the audio into smaller blocks when you write fine grained automation curves to update the parameters more often. And regarding user interaction, my experience is that the user will barely notice if a parameter changes a few hundred samples sooner or later.

We usually compute new processing parameters in the parameter change callback and write them to atomic variables. The processing code then fetches the values of the atomics at the begin of the block and writes them to a temporary variable which is then used to process the current block. For heavy computations on parameter changes such as re-computing an impulse response we even offload the calculation to a background thread which then applies the new values as soon as it’s ready. This might even create a delay of several blocks between setting the parameter and applying it to the processing, but even that is barely noticeable.

So I’d advise you to go for channels in the outer loop, samples in the inner loop and pre-fetching parameter values before processing.

1 Like

(*) in realtime playback.
If you use that strategy in non realtime, you should have a blocking call to wait until the new set is available. In an offline bounce asynchronous operations can easily arrive when rendering is already finished, at least a couple of blocks late.

1 Like

Thanks for the input.

It’s true that hardly no parameter changes need to be sample accurate considering user interaction, but in my current experiment with biquad filter coefficients, updating per block causes noticable zipping. Even with smoothing.

Although I havent looked into software synthesizers myself I guess a quick resonant filter envelope sweep can not be per process block either.

Anyway, I actually had a go with a biquad peak filter based on Johannes Menzels code in an implementation derived from the processing loop in the Juce StateVariableTPTFilter and reversed it’s iteration order to add per sample updating of coefficients for smoothing. The difference is huge, it sounds much better, I don’t think I can detect any zipping at all with my ears. But as you say the processing is probably less efficient.

Now the loop looks like this:

   for (size_t i = 0; i < numSamples; ++i)
        {      
            for (size_t channel = 0; channel < numChannels; ++channel){
                auto* inputSamples  = inputBlock .getChannelPointer (channel); //how heavy is this?
                auto* outputSamples = outputBlock.getChannelPointer (channel);
                outputSamples[i] = processSample ((int) channel, inputSamples[i]);
            }
        }

One thing I’m unsure about is the two .getChannelPointer assignments. Is that a heavy operation so I would benefit from putting that outside the innerloop and access already assigned channelpointers (one per channel obiously) instead?

Another strategy that I’m not sure how to implement but that should work:
Pre generate the coefficient updates so the same “curve” can be applied identical on both channels per processBlock. But question is if that would be more effective. I would need separate coefficients per channel and those would require update calls times the number of channels. Maybe less effective in the end?

Also if updating per process block I guess a simple crossfade between one block with previous coefficients and the one block with updated coefficients would work. I haven’t figured out how to crossfade AudioBlocks. Seems I can’t use any predefined classes for that.

Interesting topic anyway.

Performance wise: I think the most important lesson is, don’t do any premature optimisations! We hardly know anything about the product you are designing and use cases you what to achieve. So you know probably best how much resources your plugin is “allowed” to take. Sometimes doing stuff per sample is easier to implement (like in your case) and if you don’t have performance problems you probably shouldn’t worry about it (taking into account that having no problems might be defined as “staying under a critical amount of time” that suits your use cases). If you have performance problems: identify the source of the problems first! You are taking the second step before the first step. Maybe this bit isn’t even the major performance problem in your final product. This might sound a little harsh, but you’re time is also an important resource and you might have better things to do with it than squeezing the last few .05ms seconds out of your plugin

Concerning the cross fades over blocks: I think if you are thinking about that, you kind of took the wrong path previously. JUCE provides classes for smoothing parameter changes (e.g. SmoothedValue). Update the target value once at the start of a new block and let your per-sample loop do the rest.

By wrong path do you mean AudioBlock? It doesn’t seem to support much of smoothing and interpolation accross a block as far as I can see. I would expect something like start/stop value e.t.c.

Right now I’m more into learning how it all works rather than making a product so learning about various optimisations seems valuable to me. Otherwise there’s a risk of making a hopless design that won’t work when growing in size.

I meant with wrong path more in your head :slight_smile:

It’s always good to know the costs of what you are doing. @PluginPenguin explained as detailed as one needs it. If you are still learning, do it with a per sample loop and use the classes from JUCE. Don’t get too caught up in the hole performance stuff. It is important, yes – especially with real time. But computers are only getting faster. Focus on getting the feature working first, and then check where the bottle necks are.

That doesn’t mean you should shit all over performance. Just don’t get caught up in it before you have anything you can test :wink:

Concerning this specific problem (and knowing costs): reading and writing atomics is (as you can imagine) heavier at runtime as supposed to non atomic variables. This is partially why everyone reads all parameters at the start of process block and works with what is there at that point in time. So keep that in mind when you read your coefficients.

Concerning AudioBlock: AudioBlock is just a wrapper for a two dimensional float array. Semantically the same as AudioBuffer but a little closer to the hardware and with less comfort if you will. So you shouldn’t expect it to have anything to do with interpolation or smoothing.

But to finish the question: You basically said it already. Either iterate samples first, or calculate the smoothing curve first and apply that to both channels. I’d probably go with iterate samples first to have the comfort (and probably better code readability) with juice::SmoothedValue and if I have the time later down the road optimise that bit (if it is indeed a performance bottle neck).

Yes, thanks. I meant that since AudioBlock is a higher abstraction there would be support for smoothing without taking it apart in channels and iterations.

I attended a small Linux-class and since we had a look at performance profiling with grpof I made a small test comparable with the buffer/channel iteration order we’ve discussed. Code below.
I tried iterating through an enormous pre-made buffer of buffers, also iterating the same buffer over and over and finally iterating the same buffer but with new values each time. In all cases I used randomized floats to read, multiply by 0.5 and write back again. The total of samples in all cases corresponding to approx. 1 hour and 15 minutes of sound, stereo.

Iterating buffers-and-then-channels is in percantage this much slower than the reverse iteration order:

  • 14% slower for buffer of buffers 64 samples in buffer
  • 14% slower for buffer of buffers 4096 samples in buffer
  • 1% when iterating through the same buffer over and over
  • 23% when iterating through the same buffer but with new random values each time

So in general, buffers-first is slower. I’m not sure which represents how audiobuffers work best but I’m a little bit puzzled by the big difference between the last two. And also this was in c, not c++ so not quite the same.

Basically I was calling these functions in various configurations:

float buffer [BUFFERS][CHANNELS][BUFFER];

int rand_buffers ()
{
  int n, i, j;
  for (n = 0; n < BUFFERS; n++)
    for (i = 0; i < CHANNELS; i++)
      for (j = 0; j < BUFFER; j++)
      {
        buffer[n][i][j] = (float)rand()/(float)RAND_MAX;
        //printf ("%f ", buffer[n][i][j]);
      }
      return 0;
}

int update_channel_first()
{
  int n, i, j;
  for (n = 0; n < BUFFERS; n++)
    for (i = 0; i < CHANNELS; i++)
      for (j = 0; j < BUFFER; j++)
      {
        buffer[n][i][j] = buffer[n][i][j] * 0.5;
        // printf ("%d, %d", i, j);
      }
  return 0;
}

int update_buffer_first()
{
  int n, i, j;
  for (n = 0; n < BUFFERS; n++)
    for (j = 0; j < BUFFER; j++)
      for (i = 0; i < CHANNELS; i++)
      {
        buffer[n][i][j] = buffer[n][i][j] * 0.5; 
        // printf ("%d, %d", i, j);
      }
  return 0;
}

Also, 1 hour and 15 minutes of sound took a few seconds on a i5 laptop to process even with the gprof-profiling compiled in. That's pretty amazing.