BR: use of alloca in dsp::Gain can cause stack overflow

The use of alloca() in juce::dsp::Gain here can cause stack overflows if the buffer size is too big. The gain class should own some memory as a member variable, which can be preallocated in its prepare() method.

3 Likes

oh, actually this is just bad design choice, alloca itself is not the problem (if only a few bytes a reserved)

instead it would be better to do something like this (borrowed from audio block)

 for (size_t i = 0; i < numSamples; ++i)
            {
                const auto g = (NumericType) gain.getNextValue();

                for (size_t ch = 0; ch < numChannels; ++ch)
                    getDataPointer (ch)[i] *= g;
            }
1 Like

Terrible cache performance with your solution. Why go over each samples and then each channel? Go over each channel in the outer loop and over earth sample in the inner. You can even use the simd version, as alignment is a non issue nowadays.

1 Like

Why go over each samples and then each channel? Go over each channel in the outer loop and over earth sample in the inner.

Because of the SmoothedValue, which can ramp the values, sample by sample
(other solution would be reset the Smoothed Value state, after each channel)

BTW the whole block could simply be replaced by
AudioBlock& multiplyBy (SmoothedValue<OtherSampleType, SmoothingType>& value)

which also uses SIMD when there is no smoothing

if (! value.isSmoothing())
{
multiplyByInternal ((NumericType) value.getTargetValue());
}

It would probably still be faster to have multiple copies of the SmoothedValue object, so each channel does it independently.

ah I knew that you are not satisfied with it (thats why I added the text in the brackets), but than again you have to recalculate each ramp value multiple-times, might be faster, or not, maybe not, probably
premature optimization

How is this a bad design choice if it solves all of the mentioned issues?
It avoids bad cache performance and unnecessary calculation.

It’s a temporary scratch buffer and doesn’t need write operations on member memory. This is a rare case where alloca is a pretty good choice.

The SO only happens on unreasonable big buffer sizes. Probably should do something similar to the max alloca branch in juce::FFT fallback. If a dsp scratch buffer causes a stack overflow, the requested size isn’t suitable for real time processing anyway. A fallback to temporary heap memory or a fragmented loop on a limited stack chunk is probably the better choice.

Why is alloca preferable to having member memory though? At the very least, this function should check if alloca fails, or there should be a note in the documentation about buffer sizes.

I ran into this in a nonrealtime context, trying to process large buffers in a background thread. It’s quite inconvenient.

because it causes a stack overflow with big buffers, thats what I meant!

I would also doubt that the alloca version is faster, a third probably big block need to be written first, and needs to be read multiple times. That affects cache too, Is it faster? I don’t know, but could be in some situations.

In FFT there is no way around way to have such a buffer, in the gain class, no buffer is needed, and it will still perform well in all use cases.

I use SmoothedValue, which uses above mentioned method, multiple times in my code, and it’s light years away from being any performance hotspot.

So do I want that such a class potentially reserve Heap-Memory (as a fallback, as you proposed) with big buffers, probably not. IMHO this is just over-engineering and premature optimization ( and it is not even proven that it is faster)

2 Likes

While I agree with you, the general use cases of smoothed gain is only needed “once in a while”, and the non-smoothed version which is more common, will be a simple vector operation.

If you’re planning a constantly modulated signal, perhaps using a buffer for the values, or split the vector multiply into smaller chunks will be more efficient anyway?

Creating member memory is just unnecessary. This is templated dsp code. Premature optimization argument doesn’t count here. It’s a minimal basic processing element. The general guideline should be that we avoid state. State should only be created if it’s a lookup or temporal changing variable/buffer.

  1. The gains are never re/used after the processing. So why store it as a member?

  2. It’s 1 gain for N channels. Storing N SmoothedValues conflicts with the API of dsp::Gain. Gain makes no assumption about the number of input channels.

The heap memory solution was just an example of how a similar SO problem was already fixed within juce::dsp. If it’s used in a non-realtime context with huge buffers that cause SO, it could also just use a heap allocation. It’s not the intended use, so this fallback is “acceptable”.

Just securing alloca against SO via sensible fragmentation of the loop is enough. Then it also doesn’t need any documentation. Really should stay an implementation detail. Btw. dsp::Bias also uses alloca.

??? where do you see this

??? this was also never part of my solution

not sure what you are talking about

PS: this thread is so bollocks, everyone wants to shit cleverly (including me), actually this is already duplicated code, because AudioBlock& multiplyBy contains everything.

a doubt this is argument

Just remember the reason of the thread is that an alloca in a dsp routine causes a stack overflow.
I only mentioned that the use of alloca in this context, is questionable, and add an extra workaround for very big allocas is even more questionable, if there is already a solution in the codebase which is considered as fast and working.

for (size_t i = 0; i < len; ++i)
     gains[i] = gain.getNextValue();

THIS is calculated once per process, multiplied and never used again. It’s doesn’t need to be state (member of dsp::Gain). The state is already in Smoothed value. And no, you didn’t suggest this, but benvining.

Your first solution (iterating over channels in the inner loop) had bad cache performance. The alternative (by reFX) is multiple copies of SmoothedValue. Which would indeed probably be faster.

Exactly. You either have the weird loop, multiple smoothed values, or a heap allocated gains member.
Why even bother with any of that if the easiest fix is just fixing the original SO.

The suggested AudioBlock& multiplyBy is not the same. It also suffers from the unfavorable iteration order. The if (! value.isSmoothing()) is good though.

Not an argument, just a hint.

you could also reset smoothedValue to the value from the entrance after each channel iteration (except the last) , this way you didn’t have multiple states :wink:

And calling getNextValue() for each channel because why again? For the sake of not using alloca() ? Come on :joy:

this was only my addition to your critique of multiple states of the refx solution, not my favourite solution.

BTW: an additional block on the stack may also have a effect on the cache :wink:

no

I allocated the memory in Gain::prepare()

Here’s my changed juce_Gain.h file:

Rail

1 Like