Timur Doumler Talks on C++ Audio (Sharing data across threads)

With a FIFO, ‘full’ means all memory pointed to in this FIFO is potentially being read by the other thread right now. The rule is, you therefore can’t write anything new into the FIFO (nor to memory pointed-to by entries in the FIFO) until the other thread has advanced it’s ‘read pointer’ to indicate that some free slots have opened up.
You must NEVER EVER write to (non atomic) data on one thread that is being read from by another because you will experience ‘tearing’ whereby you end up with half the old data and half the new data.

1 Like

I agree, but I think there are two terms interleaved here.

In my experience, with GUI->Audio communication what you want is essentially 3 chunks of memory:

  1. The current data read by the processor
  2. The pending data that’s about to be read by the processor if a change happened
  3. The data currently written to by the GUI.

You can certainly have #1 and #2 in a FIFO-style container for storage, but the logic isn’t to push and pull by order, but to use some logic to atomically index the read indexes of the processor after a write, making sure the currently read or about-to-be-read processor data is never touched.

All of this is assuming that what you want in the processor is just the ‘latest’ data from the GUI, which I would say is the most common model, and not a full stream of data being pushed in order from another thread which is a more specialized use case.

Hey @timur

Fifo size was arbitrarily decided out of habit, as I am always using the FIFO going in the other direction (Audio → GUI)

I showed the Fifo implementation earlier in this thread:

I never considered what happens when the push fails.
My code snippet was a hypothetical approach.

My usual use-case for the Ref-Counted Objects + Fifo + Release Pool involve this strategy:

  • background thread creates the RefCountedObjectPtr’s
  • background thread adds them to the releasePool, incrementing the reference count.
  • RCOP’s are added to Fifo, for consumption on the audio thread.
  • RCOP’s refCount is always 2+ when the audio thread pulls an instance from the fifo.

a short code example:

struct Processor : juce::AudioProcessor
{
    DataObject::Ptr data;
    ReleasePool<DataObject::Ptr> releasePool;

    BackgroundObjectCreator backgroundObjectCreator { releasePool };
    Processor() 
    {
        data = new DataObject();
        releasePool.add(data); //bumps up reference count to 2. 
    }

    void setStateInformation(...)
    {
        //restore the APVTS..  then 
        backgroundObjectCreator.request( tree.getChildWithName("dataObjectProperties") );
    }

    void processBlock(...) 
    {
         DataObject::ptr t; //nullptr by default
         while( backgroundObjectCreator.pull(t) ) { ; }
         
         if( t != nullptr ) 
              data = t; //decrements reference count of 'data' to 1.

         data->process(buffer);
    }

};

My guess is the read indexes of the 3 chunks would naturally advance like {0, 1, 2, 0 , 1, 2} which is very much like a 3-slot FIFO.

1 Like

Yes, exactly, and that’s pretty much how I implemented it for my own code.

I just wanted to put some focus on the fact that it’s not a “FIFO” in the traditional sense where one side keeps pushing in order, and the other side keeps pulling in order. There’s some use case specific logic here on top of the FIFO to ensure the (single) processor’s data isn’t touched.

I call this triple buffering, by analogy with the graphics use case.

2 Likes

I would like some feedback on this design, which i got from the CPPLang slack workspace:

template<class T>
struct ValueSender
{
    void prepare(int numSamples, int numChannels)
    {
        if constexpr (std::is_same<T, juce::AudioBuffer<float>>::value )
        {
            for( auto& buf : buffer )
                buf.setSize(numChannels, numSamples);
        }
    }
                          
    void push(const T& f)
    {
        buffer[static_cast<size_t>(writer.index)] = f; 
        writer.fresh = true;

        writer = reserve.exchange(writer); //switches 'writer' with whatever was in 'reserved'
    }
    bool pull(T& t)
    {
        reader.fresh = false; //{0,false}
        reader = reserve.exchange(reader);  //switches 'reader' with whatever was in 'reserve'

        if (!reader.fresh)
        {
            return false;
        }

        t = buffer[static_cast<size_t>(reader.index)];
        return true;
    }
private:
    std::array<T, 3> buffer;
    struct DataIndex
    {
        int index;
        bool fresh = false;
    };

    DataIndex reader = {0};
    DataIndex writer = {1};
    /*
     the reserve always atomically holds the last reader or writer.
     */
    std::atomic<DataIndex> reserve { DataIndex{2} };
};

The storage for ‘DataIndex’ is more than 4-bytes (on most platforms), so on a lot of platforms, it’s NOT atomic on its own. So the compiler is going to insert a bunch of mutexes to protect it. I am assuming this is meant to be lock-free. It isn’t.

There is a ton of errors in this code.

“ton of errors” is a fairly aggressive phrase. Thread Sanitizer has not had any breakpoints triggered when I use this class on my Intel Mac.

Which system are you referring to with regard to the size of DataIndex?

Also (responding via phone), I’m wondering if you could use 3 DataIndex members, and store a pointer to one in the atomic instead, and then cache locally the pointed-to DataIndex in the push & pull functions, and use the cached pointer…

DataIndex read {0}, write {1}, extra {2};
std::atomic<DataIndex*> reserve { &extra };

I have to disagree to this one. On both my M1 Max MacBook and my Intel x84, I can squeeze a struct with four floats (so 4x4 bytes) into a lock free atomic.

@kamedin you can check that by adding static_assert(std::atomic<DataIndex>::is_always_lock_free); to your code. This fails at compile time, when you are trying to build for a platform that does indeed insert a mutex to assure atomic behaviour.

What you have to keep in mind is, that this is a single producer / single consumer structure. Only one thread is allowed to call pull (consume) at a time and only one thread is allowed to call push (produce) at a time. If you already know that this will be the message and audio thread, I would add asserts to protected you from not thinking about this when e.g. adding a thread pool or multi threading your rendering engine. To incorporate that nicely in your software design, you could subclass ValueSender and “override” both methods with an added assert. This little overhead could also be easily excluded from your release builds by hiding the subclass behind an #if JUCE_DEBUG and forwarding the subclass name to your original ValueSender with a typedef or using statement if JUCE_DEBUG is false.

if constexpr (std::is_same<T, juce::AudioBuffer<float>>::value ): this would also be a great place to use the c++20 requires clause: void prepare(int numSamples, int numChannels) requires (std::is_same<T, juce::AudioBuffer<float>>::value). This way, the compiler prevents you from calling prepare when it really doesn’t make any sense and you probably didn’t what to call it.

2 Likes

So you don’t want your code to be portable to 32-bit Intel? Limitations like this should be clearly identified in the comments.

Do you know what ‘false sharing’ is? It’s when both your read index and write index are allocated within the same cache-line. What this does is cause extra unnecessary latency (CPU load) while the processor cores sync up both indexes every time your write to one of them.

segregating the indexes looks like:

#ifdef _WIN32 // Apple don't support this yet
	alignas(std::hardware_destructive_interference_size) std::atomic<int> read_ptr;
	alignas(std::hardware_destructive_interference_size) std::atomic<int> m_committed_write_ptr;
#else
    [[maybe_unused]]char cachelinePad[64 - sizeof(std::atomic<int>)];
    std::atomic<int> read_ptr;
    [[maybe_unused]]char cachelinePad2[64 - sizeof(std::atomic<int>)];
    std::atomic<int> m_committed_write_ptr;
#endif

My point really is not to be critical just for the sake of it, but to emphasise that: Opensource FIFO examples are just a Google away. Writing your own is likely to result in incorrect or low-perfomance code.

Intel 32-bit, is the most common one that I have to support.

Note that thread-sanitizer detects race-conditions, it won’t detect that this code is not lock-free on all platforms.

The biggest red-flag though is the atomic ‘bool fresh’. Because a simple FIFO requires two atomics only: the read index and the write index. ‘fresh’ appears to be redundant, and it also is the cause of the 32-bit incompatibility. What does it achieve?

1 Like

I guess you didn’t intend to tag me. Just in case, my example is indeed spsc. The ready index is the only atomic (read and write indexes are accessed by a single thread), and it includes the changed flag in its third bit.

Oh yes, sorry. I “tripped” while reading.

I think it is used to deduce if there is something that can be retrieved with pull

It is, but it’s unnecessary. Indexes are 0…2, you can store the flag in the index itself. Restructuring my example to match this one:

template <typename T> struct TripleBuffer
{
    bool pull (T& t)
    {
        if (reserve.load (std::memory_order_relaxed) & 4)
        {
            reader = reserve.exchange (reader, std::memory_order_acquire) & 3;
            t = buffer[reader];
            return true;
        }

        return false;
    }

    void push (const T& f)
    {
        buffer[writer] = f;
        writer = reserve.exchange (writer | 4, std::memory_order_release) & 3;
    }

private:
    T buffer[3]{};
    int reader{ 0 }, writer{ 1 };
    std::atomic_int reserve{ 2 };
};

I use it differently:

template <typename T> struct TripleBuffer
{
    auto& read() const { return buffer[reader]; }
    auto& write()      { return buffer[writer]; }

    bool acquire()
    {
        int changed{ reserve.load (std::memory_order_relaxed) & 4 };

        if (changed)
            reader = reserve.exchange (reader, std::memory_order_acquire) & 3;

        return changed;
    }

    void release()
    {
        writer = reserve.exchange (writer | 4, std::memory_order_release) & 3;
    }

private:
    T buffer[3]{};
    int reader{ 0 }, writer{ 1 };
    std::atomic_int reserve{ 2 };
};

so that I read from / write to the buffer itself.

I personally don’t see a need in 32 bit support in this day and age, unless you have some legacy products.

Anyway using a static assert to make sure it’s lock free wherever you use atomics sounds like a good idea to me.

2 Likes

My Raspberry Pi (RPI4) is 32 bits. And IIRC the 64 bits OS for that platform is for now not largely used.

Indeed, targeting 32bit requires extra considerations.

However,
Just to put it in proportion. RPi since 3b is AArch64. Raspbian64 is out of beta for a while now.

With 32bit the atomics are just one of your problems since any 64bit primitive can compromise your performance.

2 Likes

Good to know! I’m personally not planning to ship a 32 bit pro audio product for end users, it seems like it would be a major support overhead unless I specifically develop for that platform.

But I guess if I’ll ever use a machine like that for that purpose I’ll make sure to install the 64 OS on it, thanks!

1 Like

Hey @matkatmusic , thanks for the info! However I still don’t understand why you need a FIFO here? You are only sharing the last object that was written on the GUI thread with the audio thread. Why don’t you atomically publish that one value to the audio thread? A FIFO is needed if you have a sequence of objects that are being pumped from one thread to another, but in this case there is no sequence, just a single object that is periodically updated.

You are literally doing this in the audio thread:

while( backgroundObjectCreator.pull(t) ) { ; }

which is discarding all objects except the last one written. Why have a FIFO then? What am I missing?

1 Like