Timur Doumler Talks on C++ Audio (Sharing data across threads)

I have to disagree to this one. On both my M1 Max MacBook and my Intel x84, I can squeeze a struct with four floats (so 4x4 bytes) into a lock free atomic.

@kamedin you can check that by adding static_assert(std::atomic<DataIndex>::is_always_lock_free); to your code. This fails at compile time, when you are trying to build for a platform that does indeed insert a mutex to assure atomic behaviour.

What you have to keep in mind is, that this is a single producer / single consumer structure. Only one thread is allowed to call pull (consume) at a time and only one thread is allowed to call push (produce) at a time. If you already know that this will be the message and audio thread, I would add asserts to protected you from not thinking about this when e.g. adding a thread pool or multi threading your rendering engine. To incorporate that nicely in your software design, you could subclass ValueSender and “override” both methods with an added assert. This little overhead could also be easily excluded from your release builds by hiding the subclass behind an #if JUCE_DEBUG and forwarding the subclass name to your original ValueSender with a typedef or using statement if JUCE_DEBUG is false.

if constexpr (std::is_same<T, juce::AudioBuffer<float>>::value ): this would also be a great place to use the c++20 requires clause: void prepare(int numSamples, int numChannels) requires (std::is_same<T, juce::AudioBuffer<float>>::value). This way, the compiler prevents you from calling prepare when it really doesn’t make any sense and you probably didn’t what to call it.

2 Likes

So you don’t want your code to be portable to 32-bit Intel? Limitations like this should be clearly identified in the comments.

Do you know what ‘false sharing’ is? It’s when both your read index and write index are allocated within the same cache-line. What this does is cause extra unnecessary latency (CPU load) while the processor cores sync up both indexes every time your write to one of them.

segregating the indexes looks like:

#ifdef _WIN32 // Apple don't support this yet
	alignas(std::hardware_destructive_interference_size) std::atomic<int> read_ptr;
	alignas(std::hardware_destructive_interference_size) std::atomic<int> m_committed_write_ptr;
#else
    [[maybe_unused]]char cachelinePad[64 - sizeof(std::atomic<int>)];
    std::atomic<int> read_ptr;
    [[maybe_unused]]char cachelinePad2[64 - sizeof(std::atomic<int>)];
    std::atomic<int> m_committed_write_ptr;
#endif

My point really is not to be critical just for the sake of it, but to emphasise that: Opensource FIFO examples are just a Google away. Writing your own is likely to result in incorrect or low-perfomance code.

Intel 32-bit, is the most common one that I have to support.

Note that thread-sanitizer detects race-conditions, it won’t detect that this code is not lock-free on all platforms.

The biggest red-flag though is the atomic ‘bool fresh’. Because a simple FIFO requires two atomics only: the read index and the write index. ‘fresh’ appears to be redundant, and it also is the cause of the 32-bit incompatibility. What does it achieve?

1 Like

I guess you didn’t intend to tag me. Just in case, my example is indeed spsc. The ready index is the only atomic (read and write indexes are accessed by a single thread), and it includes the changed flag in its third bit.

Oh yes, sorry. I “tripped” while reading.

I think it is used to deduce if there is something that can be retrieved with pull

It is, but it’s unnecessary. Indexes are 0…2, you can store the flag in the index itself. Restructuring my example to match this one:

template <typename T> struct TripleBuffer
{
    bool pull (T& t)
    {
        if (reserve.load (std::memory_order_relaxed) & 4)
        {
            reader = reserve.exchange (reader, std::memory_order_acquire) & 3;
            t = buffer[reader];
            return true;
        }

        return false;
    }

    void push (const T& f)
    {
        buffer[writer] = f;
        writer = reserve.exchange (writer | 4, std::memory_order_release) & 3;
    }

private:
    T buffer[3]{};
    int reader{ 0 }, writer{ 1 };
    std::atomic_int reserve{ 2 };
};

I use it differently:

template <typename T> struct TripleBuffer
{
    auto& read() const { return buffer[reader]; }
    auto& write()      { return buffer[writer]; }

    bool acquire()
    {
        int changed{ reserve.load (std::memory_order_relaxed) & 4 };

        if (changed)
            reader = reserve.exchange (reader, std::memory_order_acquire) & 3;

        return changed;
    }

    void release()
    {
        writer = reserve.exchange (writer | 4, std::memory_order_release) & 3;
    }

private:
    T buffer[3]{};
    int reader{ 0 }, writer{ 1 };
    std::atomic_int reserve{ 2 };
};

so that I read from / write to the buffer itself.

I personally don’t see a need in 32 bit support in this day and age, unless you have some legacy products.

Anyway using a static assert to make sure it’s lock free wherever you use atomics sounds like a good idea to me.

2 Likes

My Raspberry Pi (RPI4) is 32 bits. And IIRC the 64 bits OS for that platform is for now not largely used.

Indeed, targeting 32bit requires extra considerations.

However,
Just to put it in proportion. RPi since 3b is AArch64. Raspbian64 is out of beta for a while now.

With 32bit the atomics are just one of your problems since any 64bit primitive can compromise your performance.

2 Likes

Good to know! I’m personally not planning to ship a 32 bit pro audio product for end users, it seems like it would be a major support overhead unless I specifically develop for that platform.

But I guess if I’ll ever use a machine like that for that purpose I’ll make sure to install the 64 OS on it, thanks!

1 Like

Hey @matkatmusic , thanks for the info! However I still don’t understand why you need a FIFO here? You are only sharing the last object that was written on the GUI thread with the audio thread. Why don’t you atomically publish that one value to the audio thread? A FIFO is needed if you have a sequence of objects that are being pumped from one thread to another, but in this case there is no sequence, just a single object that is periodically updated.

You are literally doing this in the audio thread:

while( backgroundObjectCreator.pull(t) ) { ; }

which is discarding all objects except the last one written. Why have a FIFO then? What am I missing?

1 Like

The example I showed is doing that.
But that is not how I’m actually using it in my projects

I use the backgroundObjectCreator on the audio thread.
I split up the processBlock into chunks of 16 or 32 samples and request a new object from the backgroundObjectCreator for those smaller chunks.
This means the backgroundObjectCreator is being requested to create an object multiple times every processBlock. That’s why the FIFO has multiple objects that need pulling after the backgroundObjectCreator creates each element that is requested.

It’s probably overkill, but I did it for the purpose of smoothing the changes in values being used by the object creator.

If you want to talk more about it, send me a DM. I’m sure the design could be improved/optimized but it works well enough for the needs of the project.