Nan error in Convolution


#1

I have got this problem for a long time, at first I was using JUCE 5.2.1, on Windows 10, the release build sometimes cause Nan(not a number) exception, cause one or two channels to mute, however this problem doesn’t occur for debug build or on Mac OS, after one or two weeks debugging, I added ScopedNoDenormals in my code before convolution starts, and updated the convolution class to JUCE 5.3.2 develop branch, then the situation improves, but still my colleagues run into the same problem occasionally, however he is not able to reproduce this problem, I wonder if anyone has met the same problem?

BTW, several different IRs are used in the convolution, when my colleague change the IR being used, the convolution object will be re-initialized, and then the plugin behaves normal.


#2

The easiest way to work out what’s going on is to capture the problem in the debugger. A release build makes this more difficult, but apparently not impossible (I’ve not tried this myself, and it’s quite an old solution):

Add an extra branch if you come across a NaN and set a breakpoint. From there you should be able to work backwards to find the source of the NaN.


#3

Thank you for your fast reply, I have successfully traced the Nan error in Release build before, it occurs in the FFT calculation during convolution process, as far as I can remember, the error occurs in the function bufferfly.

Now the problem is that I am not able to reproduce this error on my computer, it may take a very long time, for example, I reloaded my plugin continuously for more than 100 times, the error doesn’t show up, maybe I need to roll back the previous version and test it again.


#4

I successfully reproduced the Nan error again after hundreds loading/unloading my plugin in Cubase, the captured exception is at

, but actually the Nan error is caused by one input value for function FFTFallback::performRealOnlyInverseTransform() at line 157 in juce_FFT.cpp.

I checked this Nan value, the input data starts at 0x427942A0, you can see the Nan value is right at the middle in the input complex vector, size = 1024 , i.e., the offset is 1024 * 4.

Then I also checked the inputs of the convolution object, everything looks fine, could you help me with this? Any idea will be appreciated, thank you in advance.


#5

Since it was very hard to reproduce, is there a chance, that actually the input value was already a NaN value?
But I have no information, if that is likely or unlikely, just pointing it out as a possible reason…


#6

Thanks for the report ! It seems to be an issue very complicated to reproduce, but which might happen often enough to make it serious.

It’s complicated to do anything about this until we know for sure the cause. A few remarks here :

  • Are you 100% sure that the input signal didn’t have any NaN or Inf prior to the exception ?
  • Are you 100% sure that the issue is located in the FFT code and not in the Convolution code ?

A solution to track that issue would be to check for a Inf or Nan in different places of the source code, and make it launch an exception if he founds one. This way we could detect when the issue happens.


#7

Thanks for your suggestion, I will try your method and get back to you later.


#8

Maybe I didn’t make myself clear, in the previous post, I said “actually the Nan error is caused by one input value for function FFTFallback::performRealOnlyInverseTransform() at line 157 in juce_FFT.cpp”, by this I mean that the nan value is already in the input parameter of the function FFTFallback::performRealOnlyInverseTransform(), and I checked the input parameter of the convolution object, there is nothing wrong, so I think maybe something happened in the convolution algorithm.


#9

I added some code before calling process function of Convolution class and also added some code in the convolution class:

My colleagues helped me to reproduce the error for 2 times, everytime the debugger stops at the line shown in the figure:


so, Yes, I am confident that the input data is ok, could you help me?


#10

OK, the next step would be to take those 1024 input values out of that complicated chain and see if you can reproduce the issue in a much simpler app.

Create a command line test project that just configures an FFT object in the same way and calls performRealOnlyInverseTransform on those values. If that’s still producing NaNs then send me the input values and we’re in a much better position to work out what’s going on.


#11

My guess is that there’s somewhere where you’re reading beyond the end of a buffer, and that’s where the junk value is coming from.

Generally if you’ve messed up some maths or a value has gone out of range, you’ll probably end up with an INF, but a NaN is more likely to be because you’re reading junk data and interpreting it as a float.


#12

Thank you for your reply, I did a test with the method given by t0m, the output value is normal, apparently the memory is corrupt, so the problem is what is the cause, I use AddressSanitizer on Mac and have not get anything from it.

Could you give me some advice? It seems easier to reproduce the problem on Windows


#13

Tracking things like this down can be difficult, especially if the symptoms are very intermittent.

If you’re using version control (and if not, you should be!) then you could bisect your history to work out which commit triggered the problem. You could also slowly remove features from your app/plug-in until everything works as expected. This might give you a clue as to which bit of your code is causing the trouble. Unfortunately with memory corruption it will be hard to know for sure as it may be that you simply make the symptoms less obvious.


#14

I have the same problem (using juce 5.3.2). About 5-10% of the time when I load my plug-in I get dead output channels caused by NaNs. I’ve removed all my own code apart from the call to Convolution::process() and I still have the problem, which makes me think that this isn’t a memory fault that I’ve introduced.

Just like the OP, I’ve verified that the input data to ConvolutionEngine::processSamples() is good and that the problem starts with a single NaN on the input to FFTFallback::performRealOnlyInverseTransform().

I believe that the problem is in ConvolutionEngine::updateSymmetricFrequencyDomainData(). It seems to me that outputData points to 2*FFTSize floats but that it contains only (FFTSize + 1) real value floats when it’s passed to updateSymmetricFrequencyDomainData(), due to this line:

FloatVectorOperations::copy (outputData, outputTempData, static_cast (FFTSize + 1));

updateSymmetricFrequencyDomainData() turns these (FFTSize + 1) real values into 2*FFTSize floats, which are interpreted as FFTSize complex numbers when passed to performRealOnlyInverseTransform(). The problem is that updateSymmetricFrequencyDomainData() does not write to the real value at index (FFTSize + 1) - this is the imaginary component of the middle value of the FFT, which (due to conjugate symmetry) should be 0 if the FFT is even order and of a real signal.

updateSymmetricFrequencyDomainData() contains the line:

samples[1] = 0.f;

which is apparently ensuring that the DC component of the FFT is real, but if FFTSize is even then the same needs to be done to the middle value of the FFT. i.e.

samples[FFTSize + 1] = 0.f;

Since outputData[FFTSize + 1] is not written to explicitly, and bufferOutput is not initialised to zero when setSize() is called on it, its value will be uninitialised rubbish, which may or may not be NaN. This would explain why the bug is hard to reproduce.

I’ve verified (as best I can) that zeroing that extra element fixes my problem, but I’d welcome any input as I’m working to a release deadline and this is quite a serious bug for us. The comments for updateSymmetricFrequencyDomainData() imply that it’s reversing another method, prepareForConvolution(), so it may be that the proper fix needs to look at that method too. Also if FFTSize is odd, I guess that the extra element shouldn’t be zeroed…

Thanks in advance!


#15

Could you please share your stripped down project, maybe as a PIP? It would be useful to have a reference using only JUCE code.


#16

I might be able to organise that if necessary, but just to be clear, I don’t believe the problem is specific to my project.

As far as I can tell, the problem is that there is 1 element of ConvoultionEngine::bufferOutput that’s not written to before it’s read, so could be junk. You should be able to reproduce the bug every time by inserting this at line 102 of juce_Convolution.cpp:

bufferOutput.setSample(0, static_cast<int>(FFTSize + 1), NAN);

This inserts a NaN at the index, FFTSize + 1, that won’t otherwise be written to by the convolution code. This seems like a fair test to me since bufferOutput isn’t initialised to zero (see juce_Convolution.cpp line 101) and so could easily end up like this anyway. When it does have a NaN in that location, the bug appears.


#17

Quick question for people who saw the bug. I think it might be solved just by adding this instruction in line 107 :

bufferOutput.clear();

Could you try to do this and tell me if that solves your issues ?


#18

Yep, I found that writing to that element, or clearing it initially, solved the problem.


#19

Ok so that’s what I thought.

I tried to reduce the size of the copy and other operations as much as possible, and in the frequency domain some bins are expected to be “useless” because the input samples are real, so we get some systematic bins at zeroes, and the other ones are just copies of the left half bins. So most of the time I just don’t do anything there, expecting their value will never have any impact on anything (but the inverse FFT calculation in fact)

And I forgot to init one of the buffers, so the bin at FFTSize + 1 was most of the time equal to zero, and sometimes anything else (that’s what happens when you forget to init a variable), so my unit tests were successful every time, and I didn’t see anything related to that bug until now.

So thank you very much to all the participants for the help !

@jules and @ed95 and @t0m : could you add the instruction bufferOutput.clear(); at line 107 in the Convolution.cpp file on the develop branch ? Thanks in advance !


#20

Done.

The change will appear shortly, once it clears our CI.