Mixing floats and doubles in DSP code for efficiency


I’ve started using a mix of both floats and doubles in my plugin DSP processing.

Seems to me, with simple linear operations, there is really not much to be gained by using doubles. But for non-linearities, things with feedback that compounds precision errors, filters, etc…it is very worthwhile to process doubles.

this seems to be a much more efficient approach than just templating the entire effect as float or double, as in many situations, floats are more than fine.

does that seem logical?


ive been told that for crucial opperations i should use types that fits into the word size that the processor registers uses. doubles and longs are perhaps the most efficient types from what i can deduce. but that might not be true for every environment.

what i mean is that you could use doubles for any operations, and as far as i undestand using short types adds an unnecessary overhead


Yeah, what you said is pretty spot on for a non-quantitative guide to float types vs. double types.

Generally speaking, you should never have to go into double territory for a gut feeling of “I want this to be precise, I should use doubles!” because that’s dipping into audiophile snake oil talk. When you’re designing your DSP algorithm, calculate your error margins and check to see if you start getting denormals or unacceptable small-number error where you still want useful data (i.e. a long reverb tail, or a really long feedback response) with regular float types.

Speed wise, float types and double types will generally execute serially at the same speed on pretty much any modern 64-bit CPU. Back in the day (10 years ago) it was a much bigger deal speed wise to process 64-bit types on 32-bit CPUs. If you’re using SIMD, you can get massive speed boosts (usually ~2x) by sticking to 32-bit values.


I do this all the time, mixing double and float variables, because I use iterative algorithms, and they converge faster on the solution when double variables are used. It’s important also for matrix computations. But I don’t see any reason to have something other than a float at the output of my plug-ins.


Interesting, IvanC. How come your algorithm converges faster with double?

I am using float in my DSP code, at the moment. My code uses AVX vector instructions. And my thinking is: Using float I can compute twice as much data with AVX, than using double.

I have to admit though: I did no proper benchmarking of float vs double.


Integers work differently than floating point numbers, so the answer is different.
For integer values, first try to use unsigned versions, they are faster for some things like divisions IIRC.
For floats vs doubles, it’s different, the computation units can handle both, but you can process more floats per cycle than doubles. And there are some vectorized functions that don’t work with doubles. But floats have less precisions than doubles, so for sensitive computations like badly conditioned matrix computations, doubles should be better.