[SOLVED] Branching with SIMD

I’m just starting out with SIMD by writing some AVX code. The module will eventually be a Multi Stage Envelope Generator. I started by writing sample based reference code, which I then tried to translate to AVX.

I managed to code a very basic translation, but struggled to get my head around the branching:
In my sample based code, I check after every sample if the current segment in the MSEG is finished, and move on to the next one then. In the AVX code, I process eight samples simultaneously… so no branching?

So far I just ignore it and check for the segment change after the AVX block is finished, which leads to innacurate results of course.

One option I can think of is to check whether the next eight samples will fit inside the current segment and then use either AVX or sample based code to suit the situation. This seems overly complicated?

To top it all off, I did some profiling and the AVX code seems to give around a 20-40% performance gain, so maybe it’s not worth the hassle after all? That being said it is the first SIMD code I write, which might leave much more to be optimized.

The nature of SIMD instructions is that, obviously, you’re operating on more than pieces of data per instruction. So SIMD does not lend itself well to calculations/processes where each value relies on the result of the previous value.

You mention checking the size and using either vector or scalar code depending, and that is quite a common technique. Even for algorithms that can easily be vectorized, if your data size is not a multiple of the machine’s SIMD register size, you will need to revert to scalar code to finish the extra elements at the end. Unless you’re FIFOing to always process in blocks that are multiples of the SIMD size, you’ll probably almost always need to provide both a vectorized and a scalar version of your code.

I’m by no means an expert on SIMD, but these are just my observations based on my current knowledge. Hope this helps.

1 Like

Ok, thanks for clarifying!

Indeed, SIMD is not the same as your SISD instructions. You have to write your operations as matrices and make sure that there are no dependencies otherwise.
In AVX512 (IIRC), you have masked instructions, so you can do what CUDA does on their batches, run all code paths with masked arrays.