What exactly counts as block based processing?

Hi all,

I’ve recently been reading about the advantages of block based processing, seeing one commenter say they saw a 200 and-something percent performance increase etc etc. The benefits apparently being told to be that processors doing the same calculation over and over on a block of data is faster than switching between many different functions with variables in different places, or something like that

My question is as I’m not 100 percent sure, what actually counts as per block?

for example A:

for (blocks and channels){

Buffer(sample, channel) = doSynth()}

This is technically per block, but I don’t think it’s what people are referring to. So going down another level:

example B:

checkMidiBlock()
doOscillatorBlock(buffer)
doFilterBlock(buffer)
doEffectsBLock(buffer)

This I imagine is more what they’re referring to…is this the right level, or would it go even further down?

As in, saying ‘the processor prefers doing the same thing over and over again’, well, the first example is doing exactly that…just with a larger ‘same thing’ than the second. So surely unless you truly maximised the ‘length’ of that instruction sequence it wouldn’t matter? Unsure, feels like the coastline paradox

Block based is your second example. The first one would be per sample processing. The biggest disadvantage of per sample is that all conditions are evaluated per sample and even with good branch prediction this wastes a lot of CPU cycles doing comparisons and jumps - unless the code inside doSynth() is very simple.
It often also prevents automatic vectorization and reordering of instructions for optimal performance by the compiler… any condition inside the per sample loop cannot be reordered.

Hi! Thanks v much for the reply, I see. If I can be cheeky and ask another…

Given then each function is processing the whole buffer, is it then best to make a sub-buffer for the synth object to reference and pass around and then copy into the processor output buffer. Or just still pass around the juce audio processor buffer to each of these?

Yes, that’s fairly common. There are things that you only need to compute every ‘n’ samples so dividing up the buffer into fixed size slices works well.

Just bear in mind that some hosts can pass low block sizes (1, 2, 7, etc). Specifically, FL Studio.

Block based processing usually means processing with a fixed block size like e.g. 16 or 32 samples. This has several advantages:

  • compiler can optimize better if blocksize is known at compile time
  • simplifies writing code because blocksize is known, e.g. no allocation of buffers required
  • use of SIMD types simplified
  • all modulations usually done once per block

Downside is that you need an adaptor with max 1 block extra latency for a plugin.
Also the juce way of using AudioProcessor/AudioBuffer for processing discourages this approach, splitting incoming AudioBuffers into smaller chunks is awkward and AudioProcessor is designed in a way that it cannot really use the advantages of fixed size block processing.

@stenzel Your comment about compiler optimization based on a fixed block size is VERY interesting. Can you point me to any web resources to learn more about this, ideally with code examples? Also, can you share any sample code for “an adaptor with max 1 block extra latency”? I just split JUCE blocks into short chunks presently, and just accept the fact that with some block sizes, the final chunk might be smaller than the others, but your I like what you’re proposing better.

Sure, here the adapter for a reverb. The reverb is calculated at a fixed blocksize of 16 in process16(), also on aligned data that can safely be cast to a SIMD vector on Intel or ARM, this makes it possible to have a good reverb that runs ~2k times realtime.

//uses an adapter to allow arbitrary block sizes and also unaligned memory
void reverb::process(float *inoutl,float *inoutr, int nsmp)
{
	while(nsmp > 0)                                 //normal
	{
		int n = 16 - (bufpos & 0x0F);
		if(n > nsmp) n = nsmp;

		int opposite = (bufpos + 16) & 0x1F;        //opposite side of double buffering
		float *pl = bufl + opposite;
		float *pr = bufr + opposite;
	
		memcpy(pl,inoutl,n * sizeof(float));		//copy input to aligned buffer for processing
		memcpy(pr,inoutr,n * sizeof(float));
			
		int bufnext = (bufpos + n) & 0x1F;			//next position
		if(0 == (bufnext & 0x0F)) process16(pl,pr); //process aligned

		float *rl = bufl + bufpos;
		float *rr = bufr + bufpos;
	
		for(int i=0; i<n; i++)                      //add reverb to output
		{
			*inoutl++ += *rl++;
			*inoutr++ += *rr++;
		}
		bufpos = bufnext;
		nsmp -= n;									//reduce number of samples to process
	}   
}

#if defined(WIN32) || defined(WIN64) || defined(WINDOWS)
#define ALIGN16   __declspec(align(16))
#else
#define ALIGN16  __attribute__((aligned(16)))
#endif

class reverb
{
	float ALIGN16 bufl[32];                                 // buffer for adapter to allow fixed blocksize internally
    float ALIGN16 bufr[32];                                 // same for right side
    int bufpos;                                             // r/w position in adapter buffer   
    ...
    



2 Likes

The aligned access is an historical artifact. SIMD supports both, unaligned and aligned processing and there is no performance difference anymore.

1 Like

I can’t quite agree with that.
There is still a performance difference between aligned and unaligned, it just got (a lot) smaller.
ARM and newer x86/x64 CPUs do no longer have the heavy unaligned penalty of the initial SSE CPUs, but it still makes a difference whether blocks of memory are aligned to fit cache lines.
Considering alignment usually also leads to aligning to cache lines which does make a difference and most likely always will, just because of how CPUs and cache work.

Also… people are still running those Sandy Bridge CPUs these days which come with the heavy unaligned penalty. Considering the gains from aligning to cache lines, I believe it’s still well worth the little effort to support aligned simd access all the way.

2 Likes

Bit embarassing, my code is wrong, should be process16(bufl + bufnext,bufr + bufnext); where process16() is called, sorry.