State of the Art Denormal Prevention


#1

Something like this should be part of the JUCE Library, it makes all other denormal workarounds needless.

It activates the Flush To Zero(FTZ) and Denormals Are Zero (DAZ) mode for x86 & x64 

At least the code should generated with SSE2 instructions for floating point operations (which is standard in VS2012 & XCODE)

Not sure if there is something similar for ARM.

I did performance-tests with 32/64 bit Xcode & VS2012 and it works! (at least on my intel-processors)

Just add "ScopedNoDenormals​ nsd;" at the top of the processBlock() in your audio plugin.

 

#include "xmmintrin.h"

class ScopedNoDenormals
{
public:
    ScopedNoDenormals()
    {

        //There is also C99 way of doing this, but its not widely supported: fesetenv(...)

        oldMXCSR = _mm_getcsr(); /*read the old MXCSR setting */ \
        int newMXCSR = oldMXCSR | 0x8040; /* set DAZ and FZ bits */ \
        _mm_setcsr( newMXCSR); /*write the new MXCSR setting to the MXCSR */
    };
    
    ~ScopedNoDenormals()
    {
        _mm_setcsr( oldMXCSR );
    };
    
    int oldMXCSR;
    
};

 

 

 

 


How does JUCE_UNDENORMALISE macro works?
[3.2] AAX terribly slow
ScopedNoDenormals issue
Projucer should have a denormal support flag
#2

Does anybody has reliable information how de-normals and these flags affecting AMD processors? I couldn't find any information...

Here is some testcode, if somebody has an AMD processor, would be great to check this out

 


#ifndef MAINCOMPONENT_H_INCLUDED
#define MAINCOMPONENT_H_INCLUDED
#include "../JuceLibraryCode/JuceHeader.h"
#include "xmmintrin.h"

class ScopedNoDenormals
{
public:
    ScopedNoDenormals()
    {
        oldMXCSR = _mm_getcsr(); 
        int newMXCSR = oldMXCSR | 0x8040;
        _mm_setcsr( newMXCSR); 
    };
    
    ~ScopedNoDenormals()
    {
        _mm_setcsr( oldMXCSR );
    };
    
    int oldMXCSR;
    
};

class MainContentComponent   : public Component, public Thread, public Timer
{
public:
   //==============================================================================
   
    
    
    MainContentComponent()
    : Thread("DenormalTest")
    {
        improvement=0.f;
        setSize (600, 400);
    
        startThread();
        startTimer(1000);
    
    }
    ~MainContentComponent()
    {
    }
    void timerCallback() override
    {
        repaint();
    }
    void paint (Graphics& g)
    {
        g.fillAll (Colour (0xff001F36));
        g.setFont (Font (16.0f));
        g.setColour (Colours::white);
        g.drawText ("ScopedNoDenormals is x"+String(improvement,5)+" faster" , getLocalBounds(), Justification::centred, true);
    }
    void resized()
    {
        // This is called when the MainContentComponent is resized.
        // If you add any child components, this is where you should
        // update their positions.
    }
    double calc()
    {
        double denormal=std::numeric_limits<double>::min()*0.5;
        
        double half=0.5;
    
        for (int i=0; i<10000000;i++)
        {
            
#if JUCE_32BIT && JUCE_WINDOWS
            __asm
            {
                movsd xmm0,denormal ;
                movsd xmm1,half ;
                mulsd xmm0,xmm1 ;
                mulsd xmm0,xmm1 ;
                mulsd xmm0,xmm1 ;
                mulsd xmm0,xmm1 ;

            };
#else
        
            // We use intrinsics, because normal arithmetic code would be optimized
            __m128d r1;
            __m128d r2;
        
            r1 =_mm_load_sd (&denormal);
            r2 =_mm_load_sd (&half);
            r1 =_mm_mul_sd(r1,r2);
            r1 =_mm_mul_sd(r1,r2);
            r1 =_mm_mul_sd(r1,r2);
            r1 =_mm_mul_sd(r1,r2);
    
#endif
        };
        return 0;
    }
    void run()
    {
        while (!threadShouldExit())
        {
            int64 before=Time::getHighResolutionTicks();
            calc();
            int64 usedTimeDenormal = Time::getHighResolutionTicks() - before;
            before=Time::getHighResolutionTicks();
            {
                ScopedNoDenormals n;
                calc();
            }
            
            int64 usedTimeNoDenormal = Time::getHighResolutionTicks() - before;
            improvement=(double)usedTimeDenormal/(double)usedTimeNoDenormal;
        
        };
    }


    double improvement;
private:
    //==============================================================================
    JUCE_DECLARE_NON_COPYABLE_WITH_LEAK_DETECTOR (MainContentComponent)
};

#endif  // MAINCOMPONENT_H_INCLUDED


How does JUCE_UNDENORMALISE macro works?
#3

Isn't this the same thing that FloatVectorOperations::enableFlushToZeroMode() does?


#4

Oh yes, i didn't realized there is something similar. However, this also adds (similar)  DAZ (Denormals are Zero) flag, and resets the old behavior. 

And FloatVectorOperations is maybe misleading, because the newer compilers (from VS2012, XCode) using SSE for all float-operations per default.

 

How to check if denormals are currently used:

  float f=std::numeric_limits<float>::min()*0.5f; 

  jassert (f==0); // you cannot do this in a one line, would be optimized away

            

  double d=std::numeric_limits<double>::min()*0.5;

  jassert (d==0);

 

 

Other infos:

http://carlh.net/plugins/denormals.php

 

 


#5

Thanks chkn for sharing your nifty ScopedNoDenormals!


#6

FYI I've added a new function for setting these flags: FloatVectorOperations::disableDenormalisedNumberSupport()

(Normally I love RAII stuff, but in this case I can't see much advantage in doing that, since the use-case for this is in an audio thread where you never really want to re-enable the flags.)


#7

Jules, please check out this note from the Avid developer site (requires developer login):

https://developer.digidesign.com/index.php?L1=5&L2=13&L3=56

In the context of altering the DAZ+FZ policy for Pro Tools audio render threads, they call out problems from existing plug-ins that alter the denormal behavior and don't set it back the way they found it.  And they specifically offer a RAII implementation to encourage folks to leave the processor flags in the same state your render call got them.

Just because *many* audio use cases don't want these flags re-enabled doesn't mean all.  For example, consider this warning from the OS X SDK:

CAUTION: The math library currently is not architected to do the right thing in the face of DAZ + FZ mode. For example, ceil( +denormal) might return +denormal rather than 1.0 in some versions of MacOS X. In some circumstances this may lead to unexpected application behavior. Use at your own risk.


How does JUCE_UNDENORMALISE macro works?
#8

please don't tell me you use those on *plugins*.

setting those flags should be a one-time thing, called on the start of the audio thread.
that way all code that runs on that thread will have the special flags.

this is something for the host to do, not the plugin!

it's nasty for plugins to use, specially for hosts that do offline rendering for analysis and other stuff.


#9

I got a couple of warnings in that disableDenormalisedNumberSupport() (+ another one in juce_ZipFile)

juce_FloatVectorOperations.cpp:993:23: Implicit conversion changes signedness: 'unsigned int' to 'const int'
juce_FloatVectorOperations.cpp:994:23: Implicit conversion changes signedness: 'int' to 'unsigned int'
juce_ZipFile.cpp:63:43: Implicit conversion changes signedness: 'unsigned int' to 'const int'
 


#10

Thanks - not sure why I didn't see those!


#11
this is something for the host to do, not the plugin!

no, you can not expect that. There are hosts which doesn't set these flags. Absolutly!

Settings this flags isn't expensive, also the host may rely  on that the setting isn't changed after the callback.

I think the best way is to reset the old state.

 


#12

But you don't want to mess the host flags....


#13

Very interesting stuff, I'm not an expert on this topic, but doing some research I found out there is no guarantee that this flag is available on 32bit CPUs, see here: https://software.intel.com/en-us/node/513376

More problematically: when writing to this flag when not available a general protection exception will be raised, crashing your program if I understand correctly. This can be checked with the instruction fxsave according to http://softpixel.com/~cwright/programming/simd/sse.php

Does anyone have experience both with checking the availability of the flag, and whether cpus not supporting this are still around?


#14

if you have a PC that doesn't support this flag, then you should update or move to a different machine.

afaik only old 32bit PCs don't support this, like Athlon XP or Pentium III.

I seriously hope you don't have clients running those...


#15

The flags are part of SSE2 so one could probably check SystemStats::hasSSE2() and set the flags accordingly (one of the flags was already available in SSE).

From a quick reading on Wikipedia it seems to be supported in CPUs starting from (and including) Pentium4/Pentium M and Athlon 64. All Intel Macs will support it and I doubt that using current plugins on a machine that is old enough to not support SSE2 will be any fun.


#16

Thanks for the info, sounds good!

 

I have one more question

At least the code should generated with SSE2 instructions for floating point operations (which is standard in VS2012 & XCODE)

Does anyone have sources to back this up? Because now it seems it is the compiler that can save us from denormals, or still give them, by using regular float instructions in stead of SSE.


#17

I just re-read the documentation, for VS2012

https://msdn.microsoft.com/de-de/library/7t5yh4fd(v=vs.110).aspx

From the link:

/arch:SSE2
Enables the use of SSE2 instructions. This is the default instruction on x86 platforms if no /arch option is specified.

The optimizer chooses when and how to use the SSE and SSE2 instructions when /arch is specified. It uses SSE and SSE2 instructions for some scalar floating-point computations when it determines that it is faster to use the SSE/SSE2 instructions and registers instead of the x87 floating-point register stack. As a result, your code will actually use a mixture of both x87 and SSE/SSE2 for floating-point computations. Also, with /arch:SSE2, SSE2 instructions can be used for some 64-bit integer operations.

It looks like there is no guarantee, that the compiler choose SSE2 operations for 32bit (64 always uses SSE, because imho it has no x87 fp-set) But i checked the assembler in my case, i found a lot of SSE instructions inside.
(And it solved my denormal issues)


#18

I suggest that @chkn’s ScopedNoDenormals or an equivalent should be part of JUCE and should also be called in the processBlock method of Projucer’s template for audio plugins (extras/Projucer/Source/BinaryData/jucer_AudioPluginFilterTemplate.cpp).

Otherwise what I believe currently happens is that every new plugin developer discovers the issue of denormals when for example a user complains that in REAPER the plugin takes more CPU when playing is stopped, and it takes them time an effort to investigate the issue… This issue may also hit them additional times if they forgot about it when developing new plugins, hence the need for including this call in the plugin template.

Cheers, Yair


#19

+1.

Great suggestion. (I also use @chkn’s ScopedNoDenormals.)


#20

+1 this looks like the right solution with no downsides…