Fast SIMD based random number generator? Whether and how to vectorize Random::nextInt()

I am working on a synthesizer that needs 16 random numbers (float, range -1.0f…+1.0f) per voice and frame. So for 16 voices this means 256 random numbers per frame.

Instead of calling Random::nextInt() 256 times, what I would like to do is use AVX intrinsics to create 8 randoms numbers at a time.

After reading some papers on random number generation, I have learnt that the topic is not trivial at all. There are lots of algorithms, and you need to know your requirements well when selecting one.

In my case, the most important requirement is that it has to be as fast as possible.

Let’s consider Random::nextInt() as implemented in JUCE

int Random::nextInt() noexcept
{
	seed = (int64) (((((uint64) seed) * 0x5deece66dLL) + 11) & 0xffffffffffffLL);

	return (int) (seed >> 16);
}

At a first glance this algorithm seems to have been chosen for good real time performance. But was that the only criterion? Meaning: Is there a faster algorithm?

Also: When thinking about vectorizing nextInt(), I notice that

	seed = (int64) (((((uint64) seed) * 0x5deece66dLL) + 11) & 0xffffffffffffLL);

uses signed and unsigned 64 bit int types. I have not yet figured out what the exact purpose of using these types is (although I have the suspicion that this has an influence on the quality of the generated random numbers) but since I want to calculate 8 random values simultaneously I only have 8 X 32bit available.

Is there a random number generation expert out there?

Again, my questions are:

  • Was the algorithm used in the juce::Random class chosen exclusively for the best real time performance or is there a better choice in my case?
  • What is the purpose of the 64bit int values used in nextInt()? Can I use 32bit ints instead? What difference does it make?

Thank you very much for any reply!

P.S.: Apparently, based on this, Lehmer’s generator is the fastest random number generator out there. Wikipedia suggests the following 32bit implementation

uint32_t lcg_parkmiller(uint32_t *state)
{
	const uint32_t A = 48271;

	uint32_t low  = (*state & 0x7fff) * A;			// max: 32,767 * 48,271 = 1,581,695,857 = 0x5e46c371
	uint32_t high = (*state >> 15)    * A;			// max: 65,535 * 48,271 = 3,163,439,985 = 0xbc8e4371
	uint32_t x = low + ((high & 0xffff) << 15) + (high >> 16);	// max: 0x5e46c371 + 0x7fff8000 + 0xbc8e = 0xde46ffff

	x = (x & 0x7fffffff) + (x >> 31);
	return *state = x;
}

I used the Park-Miller-Carta approach (courtesy of Robin Whittle) in DSP Testbench. It was fast enough that I didn’t bother vectorising it :wink:

I’m no random number expert but, just some points. The Random::nextInt implementation uses long long constants which don’t fit 32 bits, so it must work on 64. Hex integer constants are always unsigned, so it’s common to cast to unsigned when you use them. With two’s complement integers there’s no real difference -results are compatible between signed and unsigned-, except when shifts are involved. Shifts on negative were undefined / implementation defined before C++20, and right shifts on negative are always implemented arithmetic, so they’re different for signed and unsigned.

The problem with vectorizing any of these is that you have a feedback loop -each output depends on the previous one, so you can’t trivially translate the algorithm to get consecutive values. But you can, and it may apply to your case, vectorize a number of parallel, independent lines. For example, with SSE:

// you'd usually want all this to be forceinlined
struct Int4v
{
    __m128i v;
    Int4v()                            : v{ _mm_setzero_si128() } {}
    Int4v (__m128i a)                  : v{ a } {}
    Int4v (int a)                      : v{ _mm_set1_epi32 (a) } {}
    Int4v (int a, int b, int c, int d) : v{ _mm_set_epi32 (d, c, b, a) } {}
    operator __m128i() const           { return v; }
    Int4v operator>> (int a) const     { return _mm_srli_epi32 (v, a); }
    Int4v operator<< (int a) const     { return _mm_slli_epi32 (v, a); }
};

Int4v operator+ (Int4v a, Int4v b) { return _mm_add_epi32(a, b); }
Int4v operator* (Int4v a, Int4v b) { return _mm_mul_epi32(a, b); }
Int4v operator& (Int4v a, Int4v b) { return _mm_and_si128(a, b); }

// this should use vectorcall on Windows
template <typename IntegerT> IntegerT nextRandom (IntegerT x)
{
	const IntegerT a{ 48271 };
	auto low{ (x & 0x7fff) * a }, high{ (x >> 15) * a };
	x = low + ((high & 0xffff) << 15) + (high >> 16);
	return x = (x & 0x7fffffff) + (x >> 31);
}

Then you can use the template with different SIMD wrappers, or plain ints.

I’ve used the random number generators introduced in C++11 quite a lot. E.g. this code outputs random floats in the -1.0 to 1.0 range.

#include <random>

int main()
{
    std::uniform_real_distribution<double> dist {-1.0, 1.0};
    std::mt19937 mt {std::random_device()()};

    for (int i = 0; i < 100; ++i)
        printf("random number = %f\n", dist(mt));
}

Just for the record: I assume you are referring to Park-Miller-Carta Pseudo-Random Number Generators

grafik

Yep

Thanks! For the record: This is the Mersenne Twister algorithm Mersenne Twister - Wikipedia.
After some quick reading, I think I learnt that while not the fastest random number generator, this algorithm is a good choice for applications that require random numbers of high statistical quality (https://www.researchgate.net/publication/228468728_Performance_and_Quality_of_Random_Number_Generators/link/00463526aefbf3b18e000000/download, Section 4).

Cool, that looks very similar to what I think I need! Independent lines are fine.

Now I have three choices:

  • Create an AVX version of the original JUCE Random class. In this case, since 64bit variables are used, “only” 4 random numbers can be generated at a time but at a first glance the algorithm looks particularly simple/fast, so maybe in the end this could be the preferable solution.

  • Use one of the two proposed algorithms that work with 32 bit variables. in this case, 8 random numbers can be generated at a time but it might take a bit more effort to calculate them.

Again for the record: AVX version of Int4v (and the operators):

struct Int8v
{
	__m256i v;
	__forceinline Int8v() : v{ _mm256_setzero_si256() } {}
	__forceinline Int8v(__m256i a) : v{ a } {}
	__forceinline Int8v(int a) : v{ _mm256_set1_epi32(a) } {}
	__forceinline Int8v(int a, int b, int c, int d, int e, int f, int g, int h) : v{ _mm256_set_epi32(h, g, f, e, d, c, b, a) } {}
	__forceinline operator __m256i() const { return v; }
	__forceinline Int8v operator>> (int a) const { return _mm256_srli_epi32(v, a); }
	__forceinline Int8v operator<< (int a) const { return _mm256_slli_epi32(v, a); }
};

__forceinline Int8v operator+ (Int8v a, Int8v b) { return _mm256_add_epi32(a, b); }
__forceinline Int8v operator* (Int8v a, Int8v b) { return _mm256_mul_epi32(a, b); }
__forceinline Int8v operator& (Int8v a, Int8v b) { return _mm256_and_si256(a, b); }

Instead of declaring your own struct that maps those operators to simd instrincts you could simply use the juce dsp::SIMDRegister which basically does exactly that (incl. built-in support for ARM NEON) – why reinvent the wheel?

Thanks, that’s good to know!

Here is a quick and dirty (compiling but untested/probably buggy!) implementation of the three choices. What I will attempt is (1) make all three generators work properly, (2) do some benchmarking and (3) have a brief look at the quality of the generated numbers.

class FastRandomSource8AVX_2
	{
	private:
		const Int8v A{ 48271 };
		Int64_4v states_juce{ 1,2,3,4 };
		Int8v states_kamedin{ 1,2,3,4,5,6,7,8 };
		Int8v states_andrewj{ 1,2,3,4,5,6,7,8 };
	public:
		__forceinline __m128i __vectorcall nextRandom_juce() //only four at a time, call twice for 8 values
		{
			states_juce = ((states_juce * 0x5deece66dLL) + 11) & 0xffffffffffffLL;
			states_juce = _mm256_srli_epi64(states_juce, 16);
			return _mm256_cvtsepi64_epi32(states_juce);
		}
		__forceinline Int8v __vectorcall nextRandom_kamedin()
		{
			const Int8v a{ 48271 };
			auto low{ (states_kamedin & 0x7fff) * a };
			auto high{ (states_kamedin >> 15) * a };
			states_kamedin = low + ((high & 0xffff) << 15) + (high >> 16);
			return states_kamedin = (states_kamedin & 0x7fffffff) + (states_kamedin >> 31);
		}
		__forceinline Int8v __vectorcall nextRandom_andrewj()
		{
			const Int8v a{ 16807 };
			auto low { (states_andrewj & 0xffff) * a };
			auto high{ (states_andrewj >> 16) * a };
			states_andrewj = low + ((high & 0x7fff) << 16) + (high >> 15);

			auto mask = _mm256_cmpgt_epi32 (states_andrewj, _mm256_set1_epi32(0x7fffffff));
			mask = _mm256_and_si256 (mask, _mm256_set1_epi32(0x7fffffff));
			states_andrewj = _mm256_sub_epi32(states_andrewj, mask);

			return states_andrewj;
		}
	};

Idea: A cheap (translating to efficient ;-)) trick could be to use loops of precalculated random numbers. One would just need to get rid of the offset. Or, In a separate low priority thread, the random numbers in the loops could be randomly replaced by new ones all the time using any available algorithm such as the Mersenne Twister or juce::Random. For my purposes this could work :slight_smile:

After a quick search i found that < xoshiro/xoroshiro generators and the PRNG shootout >.

I was under the impression that SIMDRegister only worked with floating-point, but it seems not to be the case indeed.

Some points.

  • Whether you use JUCE classes or your own, I’d really recommend to encapsulate all the intrinsics in them.
  • If you only need int32, work on int32. There are enough random generators out there to pick the one that fits your case.
  • Remember there are no 256-bit vectors in ARM.

For the record: This is the paper that describes the xoshiro/xoroshiro algorithms: http://vigna.di.unimi.it/ftp/papers/ScrambledLinear.pdf

1 Like