I need to compute some lookup tables that hold values for an FFT-based correlation, executed on the GPU. To boost some efficiency, the algorithm expects my pre-computed values to have a bit reversed order in memory, so I need to shuffle a lot of arrays this way. Until now the values were precomputed with Matlab which has the `bitrevorder`

function to do this and then read from a file, however now it should be done on the fly at application startup.

Now I’m looking for the most efficient way to implement this. As this is a task that is performed under the hood in some FFT implementation I thought that maybe someone here knows a nice efficient way of doing this, e.g. some fancy bitshift operations. All my straightforward ideas seem quite clunky.

If you don’t know what I’m talking about, just have a look here: https://de.mathworks.com/help/signal/ref/bitrevorder.html

What is the data type of these values?

`std::complex<float>`

. But note that I don’t want to flip the bits of the actual values but their position in the array according to a bit-reversed indexing, just to avoid any confusion.

E.g. if I had the array `std::array<int, 8> x {1, 2, 3, 4, 5, 6, 7, 8}`

the shuffled version of that would be `{1, 5, 3, 7, 2, 6, 4, 8}`

How big can the array be?

The current size is 16384 elements, that means 14 Bit indexes. My current implementation looks like this

```
template <typename T>
static void permuteInBitReversedOrder (T* dest, const T* source, const int order)
{
const int numItemsToPermute = 1 << order;
for (int i = 0; i < numItemsToPermute; ++i)
{
// Compute the bit-reversed index... can this be optimized?
int iReversed = 0;
for (int shift = 0 ; shift < order; ++shift)
{
iReversed <<= 1;
iReversed |= (i >> shift) & 0x01;
}
dest[i] = source[iReversed];
}
}
```

I was curious, so I googled it. There appears to be this interesting way of doing this:

```
unsigned int ReverseTheBits(register unsigned int x)
{
x = (((x & 0xaaaaaaaa) >> 1) | ((x & 0x55555555) << 1));
x = (((x & 0xcccccccc) >> 2) | ((x & 0x33333333) << 2));
x = (((x & 0xf0f0f0f0) >> 4) | ((x & 0x0f0f0f0f) << 4));
x = (((x & 0xff00ff00) >> 8) | ((x & 0x00ff00ff) << 8));
return((x >> 16) | (x << 16));
}
```

You can also re-order the Elements in-place (in case you want to avoid the memory allocation):

```
template <typename T>
void reorderInPlace(T* items, unsigned int order)
{
const auto numItemsToPermute = 1 << order;
for(int i=0; i < numItemsToPermute; ++i)
{
const auto dstIndex = ReverseTheBits(i) >> (8*sizeof(int) - order);
if(i == dstIndex) continue; // they are the same, do nothing
if(dstIndex < i) continue; // we already did this one
std::swap( items[i], items[dstIndex] );
}
}
```

cheers,

Ben

1 Like

Thanks for that code and for the link, why didn’t I find that website by myself?

I refactored the code a bit and made it templated, so this is my solution now:

```
template <uint8_t numSignificantBits>
static uint32_t reverseTheBits (register uint32_t x)
{
static_assert (numSignificantBits <= 32, "A 32 Bit int cannot have more than 32 significant bits");
x = (((x & 0xaaaaaaaa) >> 1) | ((x & 0x55555555) << 1));
x = (((x & 0xcccccccc) >> 2) | ((x & 0x33333333) << 2));
x = (((x & 0xf0f0f0f0) >> 4) | ((x & 0x0f0f0f0f) << 4));
x = (((x & 0xff00ff00) >> 8) | ((x & 0x00ff00ff) << 8));
x = (x >> 16) | (x << 16);
return x >> (32 - numSignificantBits);
}
template <uint8_t order, typename T>
static void permuteInBitReversedOrder (T* array)
{
constexpr int numItemsToPermute = 1 << order;
for (unsigned int i = 0; i < numItemsToPermute; ++i)
{
const auto iReversed = reverseTheBits<order> (i);
if (i == iReversed) continue; // no need to swap an item with itself
if (iReversed < i) continue; // already did this one
std::swap (array[i], array[iReversed]);
}
}
```

Another thing in this context, I saw the function you copied here from that website uses the `register`

keyword. As I haven’t seen this keyword anywhere else so far I wonder if it is really needed for modern compilers to gain any performance?

Edit: Just put the code into the godbolt.org online compiler and checked what the register keyword did to the code. Results:

- With optimization on, I could not trigger a case where the compiler did not fully optimize the function away
- With optimization off, gcc needed more instructions for the version missing the register keyword, all other compilers produced the same assembly in either way
- clang warned me that register will be deprecated in C++17

No benefit of adding register or inline IMHO - it’s just clutter nowadays.

1 Like