Efficiently shuffle a vector into a bit reversed order

I need to compute some lookup tables that hold values for an FFT-based correlation, executed on the GPU. To boost some efficiency, the algorithm expects my pre-computed values to have a bit reversed order in memory, so I need to shuffle a lot of arrays this way. Until now the values were precomputed with Matlab which has the bitrevorder function to do this and then read from a file, however now it should be done on the fly at application startup.

Now I’m looking for the most efficient way to implement this. As this is a task that is performed under the hood in some FFT implementation I thought that maybe someone here knows a nice efficient way of doing this, e.g. some fancy bitshift operations. All my straightforward ideas seem quite clunky.

If you don’t know what I’m talking about, just have a look here: https://de.mathworks.com/help/signal/ref/bitrevorder.html

What is the data type of these values?

std::complex<float>. But note that I don’t want to flip the bits of the actual values but their position in the array according to a bit-reversed indexing, just to avoid any confusion.

E.g. if I had the array std::array<int, 8> x {1, 2, 3, 4, 5, 6, 7, 8} the shuffled version of that would be {1, 5, 3, 7, 2, 6, 4, 8}

How big can the array be?

The current size is 16384 elements, that means 14 Bit indexes. My current implementation looks like this

template <typename T>
static void permuteInBitReversedOrder (T* dest, const T* source, const int order)
{
    const int numItemsToPermute = 1 << order;

    for (int i = 0; i < numItemsToPermute; ++i)
    {
        // Compute the bit-reversed index... can this be optimized?
        int iReversed = 0;
        for (int shift = 0 ; shift < order; ++shift)
        {
            iReversed <<= 1;
            iReversed |= (i >> shift) & 0x01;
        }

        dest[i] = source[iReversed];
    }

}

I was curious, so I googled it. There appears to be this interesting way of doing this:

unsigned int ReverseTheBits(register unsigned int x)
{
     x = (((x & 0xaaaaaaaa) >> 1) | ((x & 0x55555555) << 1));
     x = (((x & 0xcccccccc) >> 2) | ((x & 0x33333333) << 2));
     x = (((x & 0xf0f0f0f0) >> 4) | ((x & 0x0f0f0f0f) << 4));
     x = (((x & 0xff00ff00) >> 8) | ((x & 0x00ff00ff) << 8));
     return((x >> 16) | (x << 16));
}

You can also re-order the Elements in-place (in case you want to avoid the memory allocation):

template <typename T>
void reorderInPlace(T* items, unsigned int order)
{
	const auto numItemsToPermute = 1 << order;

	for(int i=0; i < numItemsToPermute; ++i)
	{
	    const auto dstIndex = ReverseTheBits(i) >> (8*sizeof(int) - order);
	    if(i == dstIndex) continue; // they are the same, do nothing
	    if(dstIndex < i) continue; // we already did this one
	    std::swap( items[i], items[dstIndex] );
	}
}

cheers,
Ben

1 Like

Thanks for that code and for the link, why didn’t I find that website by myself?

I refactored the code a bit and made it templated, so this is my solution now:

template <uint8_t numSignificantBits>
static uint32_t reverseTheBits (register uint32_t x)
{
    static_assert (numSignificantBits <= 32, "A 32 Bit int cannot have more than 32 significant bits");

    x = (((x & 0xaaaaaaaa) >> 1) | ((x & 0x55555555) << 1));
    x = (((x & 0xcccccccc) >> 2) | ((x & 0x33333333) << 2));
    x = (((x & 0xf0f0f0f0) >> 4) | ((x & 0x0f0f0f0f) << 4));
    x = (((x & 0xff00ff00) >> 8) | ((x & 0x00ff00ff) << 8));
    x = (x >> 16)                | (x << 16);

    return x >> (32 - numSignificantBits);
}

template <uint8_t order, typename T>
static void permuteInBitReversedOrder (T* array)
{
    constexpr int numItemsToPermute = 1 << order;

    for (unsigned int i = 0; i < numItemsToPermute; ++i)
    {
        const auto iReversed = reverseTheBits<order> (i);

        if (i == iReversed) continue; // no need to swap an item with itself
        if (iReversed < i)  continue; // already did this one

        std::swap (array[i], array[iReversed]);
    }
}

Another thing in this context, I saw the function you copied here from that website uses the register keyword. As I haven’t seen this keyword anywhere else so far I wonder if it is really needed for modern compilers to gain any performance?

Edit: Just put the code into the godbolt.org online compiler and checked what the register keyword did to the code. Results:

  • With optimization on, I could not trigger a case where the compiler did not fully optimize the function away
  • With optimization off, gcc needed more instructions for the version missing the register keyword, all other compilers produced the same assembly in either way
  • clang warned me that register will be deprecated in C++17

No benefit of adding register or inline IMHO - it’s just clutter nowadays.

1 Like