[DSP module discussion] Fast function computation classes

dsp_module

#1

Hello guys !

Today, I’m going to start a talk about the classes in the DSP module that help fast computation of functions.

So backi in July I designed the functions in the class FastMathApproximations that are like it is said fast approximations of math functions such as sinh, tanh, cosh, exp, log(1+x) etc. They are faster than their std counterparts, but they have also a limited useful range, that is provided in the documentation. Basically, the error between the std function and the approximation is quite good in this range, and can be significant outside.

These functions have been designed mainly by using Pade approximant continued fractions, which means they are made from additions, multiplications and one division. They are useful in contexts where you need to call multiple times for every sample such functions, and that the std variants are the slowest part of the processing path, which can happen a lot in “virtual analog” audio effects for example, or when a variable needs to be updated at audio rate.

I’m looking also for some reviews from the JUCE community because I know there are tons of other ways to optimize such functions. I know developers that use all the time a large set of fast approximated functions, with accuracy depending on the context, and sometimes some embedded crude oversampling processing in the approximations. I saw also code written in full assembly. And I’m wondering also if it could be useful to provide additional versions of the functions with different levels of accuracy…

For fast computation of functions, there are also another classes called LookupTable and LookupTableTransform which allow developers to create one dimensional lookup tables with linear interpolation and uniform grid size for any function, that can be specified via a lambda, with a given size and for a given range of input values. They are very neat for waveshapers, for optimizing function code as well, and even for oscillators as you might have already seen in the DSP module demo application. They have been designed by the great @zsolt Zsolt Garamvölgyi from the ROLI team, and I think that class should be used everywhere in your code when possible, look up tables are one of the essential tools a DSP engineer should use all the time in his code.

So tell me if you have already used these classes, what you think about them, if you have any suggestion, remark :wink:

Ivan


#2

It would be interesting, how much faster the approximations actually are (benchmarks)


#3

I did some benchmarks obviously already, and I’ll provide them with curves as well as soon as I have time to finish my PlotComponent class, since it is important to see the speed improvement and the accuracy as well :wink:


#4

I looked into your code a few weeks back and also did some benchmarking. Your functions are 4-6x times faster than the built-in exact functions on OSX. However I ended up creating my own approximations because of multiple reasons. If I understand right your approximations use regular pade approximations, which means precision is best around 0 and gets worse the larger the values are. Of course precision is fairly high in the range you indiciate as safe, but I do not like the resulting “gradient” in precision, especially not in functions like sine and cosine. I chose a different route and used Maxima to create pade-like least squares approximations that are forced through some important values like 0. This gives an error curve that’s not dependent on the value magnitude. I find this better for audio work.

The other issue I have with the approximations is the limited range. For sine and cosine getting the argument down to the repeating range usually takes more time than calculating the approximation itself and considering that, the gain from the approximation doesn’t look as great anymore. For pow and log the limited range is even a bigger problem for the stuff I’m doing, but luckily some tricks exist based on floating point representations and exp2/log2 that make it possible to use an approximation over the full range.

Here’s what I came up with for log2 and I would be very interested to discuss this interesting topic further!

Note: code is only working on little endian cpus

typedef union {
    float flt;
    struct {
        unsigned int mant:23;
        unsigned int exp:8;
        unsigned int sign:1;
    } parts;
} IEEEfloat;

typedef union {
    double dbl;
    uint64_t uint64;
    struct {
        unsigned int mant0:32;
        unsigned int mant1:20;
        unsigned int exp:11;
        unsigned int sign:1;
    } parts;
    struct {
        uint64_t mant:52;
        unsigned int exp:11;
        unsigned int sign:1;
    } parts64;
} IEEEdouble;

template <typename Type>
inline Type log2_r2_4(Type x) {
    // pade least squares 2 in [2..4] .. forced to be precise for 2 and 4.. so there are no jumps in the curve when exp switches
    // result is always between 1 and 2

    const Type a0 = -2.828648765985416; 
    const Type a1 = 2.1369512631133;
    const Type a2 = 0.71149949580445;
    const Type b1 = 1.403358216041611;
    const Type b2 = 0.1211338278439405;
    
    Type x2 = x*x; // slower than horner, but better for precision around 0
    Type nom = a0 + x*a1 + x2*a2;
    Type denom = static_cast<Type>(1.0) + x*b1 + x2*b2;
    return nom/denom;
}

inline float approx_log2f(float x) {
    // split into mantissa and exponent	
    IEEEfloat u;
    
    u.flt = x;
    int exp = u.parts.exp - 128; // 127 for 1..2
    u.parts.exp = 128; 
    
    float mant = u.flt; // mant is now in the [2...4] range
    return (float)exp + log2_r2_4(mant); 
}

inline double approx_log2(double x) {
    // split into mantissa and exponent	
    IEEEdouble u;
    
    u.dbl = x;
    int exp = u.parts.exp - 1024; // 1023 for 1..2
    u.parts.exp = 1024;
    
    double mant = u.dbl; // mant is now in the [2...4] range
    return (double)exp + log2_r2_4(mant);
}

#5

Thanks for the feedback !

Indeed, in that topic, when designing a “fast approximated function”, we have to make choices about speed, accuracy, but also more specifically about the way the error is distributed in the usable range. I made the choice of a Pade approximant with the error increasing when the input is going away from a center value, but I could have chosen something with a limited amount of error using least squares approximations.

Your code seems quite interesting, but it is also a perfect example of what I was saying, which is that it is difficult to come up with code which will “work in all the situations” in that topic. That’s why I wanted to start a discussion here :slight_smile: I’m wondering about what would be the next iterations of the class FastMathApproximations, and how to present them, with documentation, maybe a specific benchmark application in the examples folder… I wonder also how I should design different versions of the same function, so that most of the time when you need a specific approximation you would be able to pick one fitted to your need in JUCE.

I saw also various times some approximations using assembly code, or taking into account specific behaviour of compilers and target machines. Since JUCE is made to be used with various platforms, ideally we should have approximations suited for each one as well, but doing something like to optimize further some code would be a huge work :slight_smile:


#6

Indeed. There are also lots of approximations available online, for instants fastapprox which was developed for neural networks and in my opinion isn’t accurate enough for dsp. It’s also very interesting to look into implementations of the standard functions in various libs - most of the stuff is available online and this way one gets to see some really obscure hard to read code :).

What would be very nice was to have approximations templated like your code, so it’s possible to use them with the SIMD classes you mentionned on another thread. Of course that’s currently a problem with my log2 approximations, but especially the 32 bit float one is very fast and accurate by not dealing at all with negative numbers and special cases and using the exponent trick to limit the approximation range.

Btw. in case anyone is looking for information about how to create approximations. I found Maxima to be a great tool, but in the meantime I found out about the Eigen C++ lib which has awesome speed for matrix operations that can be used to do least squares fits of functions to points.

For pade approximations there is a “trick” to convert them to two dimensional linear equations, then insert a few hundred exact points. This gives a large linear system that can be easily solved using Eigen. More information here: http://blog.revolutionanalytics.com/2017/04/fitting-rational-functions-with-lm.html

I found it’s easy to pin the values at 0 and 1 because for these cases the pade form nicely simplifies and the resulting system stays linear.


#7

Hello guys,

I was looking at the class juce_LookupTable and was wondering about it’s implementation. I have the following questions:

  1. void LookupTable<FloatType>::initialise(): Why the size of the internal data is set to numPointsToUse + 1?
  2. void LookupTable<FloatType>::prepare(): Here the last (superfluous??) entry of data is set to the penultimate entry. So far so good. But why this last entry is added and why guardIndex, getGuardIndex() and getRequiredBufferSize() are introduced? Are these any programming techniques I don’t know yet?
  3. Would it be possible to drop guardIndex, getGuardIndex() and getRequiredBufferSize() and just setting the size of data to numPointsToUse and using data.size() instead of getNumPoints() etc.?
  4. Why the index of get() resp. getUnchecked() is of type FloatType instead of const int?
  5. Does it make sense to implement a method void set (const int indexToChange, FloatType newValue) for manipulating a single value in the table?
  6. Does it make sense to declare the members data etc. as protected (instead of private) to make them available in child classes?

Thanks in advance!


#8

@zsolt


#9

Hi @Dirk,

1-4. The point of this class is that non-integer indices can be accessed via linear interpolation. Any floating-point index is allowed if 0 <= index < numPoints. Linear interpolation requires loading two neighbouring values. The extra guard element is used to avoid branching when numPoints - 1 <= index < numPoints.

  1. What would you use set() for? Can’t you achieve the same by calling initialise() with a suitable lambda?

  2. Why would you want to access data directly?


#10

Hi @zsolt,

thanks for your very quick response!

4.: Ah, I see. Because of the interpolation it is possible to use fractional indices.

1-3.: Say I create a table with numPoints = 4 and the lambda is [](int i) { return i; }. So the data in the table is

data[0] = 0
data[1] = 1
data[2] = 2
data[3] = 3
data[4] = 3
numPoints = 4

Now when I access the table via get(2.5) I get (as expected) 2.5. But when I access via get(3.5) I get 3. As 3.5 exceeds the max index numPoints - 1 there is no interpolation between data[3] and data[4]. Actually data[4] could be dropped, right? In this case also the call getGuardIndex() in get() could be replaced by data.size().

Furthermore getGuardIndex() is calling getRequiredBufferSize() and getNumPoints(). Here getRequiredBufferSize() simply adds one to it’s argument and getNumPoints() returns data.size() minus one. To me it looks that it could be simplified … don’t get me wrong - I’m still learning :wink:

5.: I would use a method set() for manipulating a single value. Assuming you want to change the table in real time it could be simpler and faster to modify the index you want instead of initialising all numPoints in the table via initialise() ‘each time’.

6.: I was thinking about creating my own method for initialisation in a child class. For example initialise (FloatType* dataArray, size_t numPointsToUse) - or something like that. And maybe I would like to customize LookupTable - so it would be handy to have access to data in the child class.


#11

No, the guard index couldn’t be dropped without the need to add an explicit guard branch like this:

if (index > numPoints - 1)
    return data[numPoints - 1];

Those simple helper functions are extremely likely to be inlined (i.e., replaced with their body) by the compiler, so there’s no overhead. IMHO adding meaningful names to functions results in much more readable code than writing random additions with comments.

Can you give me an actual example of why you’d want to change the lookup table data in real-time and how you’d want to customise the LookupTable?


#12

Okay, I understand. It’s a question about the coding style. And you’re right that there’s no overhead because of these helper functions.

I had this idea of a waveshaper whose transfer function can be modified during processing. And I could imagine that it’s intended to change only a range or a single value in the data table. As I mentioned above I would like to add my own initialise() and set() methods.


#13

OK, I guess making data protected would allow for customisation without cluttering the LookupTable interface with rarely used methods. @jules, what do you think?


#14

Well, in API terms, making something protected is basically the same as making it public.

You’re the main stakeholder in that class - if you think it’s sensible to allow people to poke around in the data member then that’s OK with me, but once we give access to that member, it means we can’t change the way it works internally in the future without breaking code.


#15

Hi @Dirk,

I’m thinking about implementing a LookupTableInterface class with static initialise() and getUnchecked() methods you could use to implement your custom lookup table class. The methods could be used like this:

juce::Array<float> lutData;
juce::dsp::LookupTableInterface::initialise (lutData, [] (float x) { return std::sqrt (x); }, 64);
auto y = juce::dsp::LookupTableInterface::getUnchecked (lutData, 10.4f);

What do you think, would this allow you to do what you want to do?


#16

Hi @zsolt,

after thinking for a while I figured out that the problem of initialising the table with my own data I can solve by using a functor. So for me there’s no need for a initialise() method anymore.

The new LookupTableInterface class looks interesting. And yes - this would me allow to do what I want. In this way one would have access to a single value resp. the whole lookup table. What I don’t like is the fact that in this case the data array doesn’t belong to the LookupTable. The object-orientation ‘disappears’ and it tends to be a bit C-style. Furthermore lutData[10.4f] looks nicer than juce::dsp::LookupTableInterface::getUnchecked (lutData, 10.4f).

What speaks against implementing a method LookupTable::set (const int indexToChange, FloatType newValue)?

Incidentally: You asked me how I’d want to customise the LookupTable. I additionally got the idea of choosing the lookup table’s interpolation type. At the moment it’s linear. But it would be cool if Lagrange were also available.