Timur Doumler Talks on C++ Audio (Sharing data across threads)

thanks @matkatmusic and @jules for the quick answers. indeed there have been several topics raised in this thread. I think the first question was rather generic, but then discussion got wild.

my use case in particular is that of having some state in a ValueTree containing tracks/clips/notes and having to share it with the audio thread to interpret it and put midi messages into a midi out buffer. I think @matkatmusic’s answer gives a minimal example of a technique I could use for doing that, so I’m planing to give it a go (thanks!). Could the DirtyList proposed by @jules be also used for that purpose? (it looks like it might be thought for exchanging messages in the opposite direction?)

In a more recent talk of mine, A lock-free atomic shared_ptr, I mentioned that the original strategy I proposed in 2015 is unworkable because atomic operations on std::shared_ptr are not lock-free.

For publishing an object from the GUI thread to the audio thread in a lock-free fashion, there is actually a whole other programming pattern which I was unaware of at the time which is a much better solution than the refcounted release pool stuff. The pattern is called RCU (Read Copy Update). I am planning to do a talk on RCU at some point soon, in the meantime I encourage you to look it up yourself.

RCU was originally invented for the Linux kernel, and it’s still the context in which it is most often discussed. But RCU can also be adapted to user-space scenarios like the stuff we audio people are doing, and then imho it’s a much better solution to the problem than anything I’ve used before.

If you’re trying to do the reverse direction (audio thread notifying GUI thread of changes), then you should think in the direction of @jules 's dirty list pattern or a lock-free FIFO.

And you can even combine two approaches to get both directions, I think this is basically what farbot RealtimeObject is doing.

6 Likes

Do you know/recommend a valid (fully tested) C++ RCU implementation in the user space?
IIRC RCU and Hazard Pointers are on the road to next C++ (C++26 ?) standard.
But nowaday?

:+1:

what I do is serialize parameter values onto the FIFO, then deserialize them on the audio thread. i.e. no mutable object is shared.
This has the advantage of not needing reference counting, locks, shared pointers, or atomics (except safely hidden within the FIFO implementation).
It also has the advantage of being seamlessly extended to work across process boundaries, between the DAW and your ‘GPU Audio’, over a network, or as an interop layer between C++ and some other language (e.g. if your GUI is written in javascript or something).
The main advantage, which is harder to quantify - is the lack of headaches. The GUI runs exclusively in one thread, the audio runs in another. All the race-conditions evaporate, all the hard-to-debug concurrency weirdness goes away.
Granted, some coders will resist this because you don’t get so many opportunities to write ‘clever’ code.

3 Likes

Thanks @timur and @JeffMcClintock for the latest answers!
RCU looks interesting but I’ll wait for @timur’s talk before diving into it as I’m not adventurous enough to go into that alone.

@JeffMcClintock what you suggest I think is also the other option I was proposing, that of keeping a parallel data structure with the app state in the audio thread and synchronize it using messages passed over a fifo. That’s actually what I do for the GUI because it runs entirely in another process (in javascript) so I could consider that as well for the audio thread. I use ValueTree change listeners to trigger sending messages. One question with that strategy however is where to find a suitable FIFO implementation (lock free etc). Can you point me to one? A simple example would be awesome :slight_smile:

Nevertheless, if I still want to try using the “passing pointers approach”, there’s a fundamental difference between original @timur’s idea (first post in this thread), and the one described 7 posts above by @matkatmusic because there’s a fifo used to pass the pointers (if I understand correctly). Do you think this is still a suitable strategy @timur? (also for @matkatmusic, what FIFO implementation are you using for that? any pointers?)

Thanks everyone a lot!!

I’m just using the standard juce::AbstractFifo-based one, which is very easy to write.

2 Likes

Hey @matkatmusic I am trying to understand your very interesting reference counted objects + FIFO + release pool strategy and I have a question.

Why does your dataObjectFifo have a capacity of 50? It seems like in the audio thread you’re only ever interested in the most current object. Can’t you have a FIFO of size 1 then? And instead of using a FIFO where push fails when the FIFO is full, you can use one that just keeps overwriting old data? At which point I don’t really understand why you need a FIFO at all?

Actually, what happens in your code if dataObjectFifo is full when the GUI thread calls processor.dataObjectFifo.push(obj)? Does it fail? What do you do then? How did you choose the size 50 in the first place?

Also, is it possible to see the implementation of the Fifo class template somewhere?

1 Like

With a FIFO, ‘full’ means all memory pointed to in this FIFO is potentially being read by the other thread right now. The rule is, you therefore can’t write anything new into the FIFO (nor to memory pointed-to by entries in the FIFO) until the other thread has advanced it’s ‘read pointer’ to indicate that some free slots have opened up.
You must NEVER EVER write to (non atomic) data on one thread that is being read from by another because you will experience ‘tearing’ whereby you end up with half the old data and half the new data.

1 Like

I agree, but I think there are two terms interleaved here.

In my experience, with GUI->Audio communication what you want is essentially 3 chunks of memory:

  1. The current data read by the processor
  2. The pending data that’s about to be read by the processor if a change happened
  3. The data currently written to by the GUI.

You can certainly have #1 and #2 in a FIFO-style container for storage, but the logic isn’t to push and pull by order, but to use some logic to atomically index the read indexes of the processor after a write, making sure the currently read or about-to-be-read processor data is never touched.

All of this is assuming that what you want in the processor is just the ‘latest’ data from the GUI, which I would say is the most common model, and not a full stream of data being pushed in order from another thread which is a more specialized use case.

Hey @timur

Fifo size was arbitrarily decided out of habit, as I am always using the FIFO going in the other direction (Audio → GUI)

I showed the Fifo implementation earlier in this thread:

I never considered what happens when the push fails.
My code snippet was a hypothetical approach.

My usual use-case for the Ref-Counted Objects + Fifo + Release Pool involve this strategy:

  • background thread creates the RefCountedObjectPtr’s
  • background thread adds them to the releasePool, incrementing the reference count.
  • RCOP’s are added to Fifo, for consumption on the audio thread.
  • RCOP’s refCount is always 2+ when the audio thread pulls an instance from the fifo.

a short code example:

struct Processor : juce::AudioProcessor
{
    DataObject::Ptr data;
    ReleasePool<DataObject::Ptr> releasePool;

    BackgroundObjectCreator backgroundObjectCreator { releasePool };
    Processor() 
    {
        data = new DataObject();
        releasePool.add(data); //bumps up reference count to 2. 
    }

    void setStateInformation(...)
    {
        //restore the APVTS..  then 
        backgroundObjectCreator.request( tree.getChildWithName("dataObjectProperties") );
    }

    void processBlock(...) 
    {
         DataObject::ptr t; //nullptr by default
         while( backgroundObjectCreator.pull(t) ) { ; }
         
         if( t != nullptr ) 
              data = t; //decrements reference count of 'data' to 1.

         data->process(buffer);
    }

};

My guess is the read indexes of the 3 chunks would naturally advance like {0, 1, 2, 0 , 1, 2} which is very much like a 3-slot FIFO.

1 Like

Yes, exactly, and that’s pretty much how I implemented it for my own code.

I just wanted to put some focus on the fact that it’s not a “FIFO” in the traditional sense where one side keeps pushing in order, and the other side keeps pulling in order. There’s some use case specific logic here on top of the FIFO to ensure the (single) processor’s data isn’t touched.

I call this triple buffering, by analogy with the graphics use case.

2 Likes

I would like some feedback on this design, which i got from the CPPLang slack workspace:

template<class T>
struct ValueSender
{
    void prepare(int numSamples, int numChannels)
    {
        if constexpr (std::is_same<T, juce::AudioBuffer<float>>::value )
        {
            for( auto& buf : buffer )
                buf.setSize(numChannels, numSamples);
        }
    }
                          
    void push(const T& f)
    {
        buffer[static_cast<size_t>(writer.index)] = f; 
        writer.fresh = true;

        writer = reserve.exchange(writer); //switches 'writer' with whatever was in 'reserved'
    }
    bool pull(T& t)
    {
        reader.fresh = false; //{0,false}
        reader = reserve.exchange(reader);  //switches 'reader' with whatever was in 'reserve'

        if (!reader.fresh)
        {
            return false;
        }

        t = buffer[static_cast<size_t>(reader.index)];
        return true;
    }
private:
    std::array<T, 3> buffer;
    struct DataIndex
    {
        int index;
        bool fresh = false;
    };

    DataIndex reader = {0};
    DataIndex writer = {1};
    /*
     the reserve always atomically holds the last reader or writer.
     */
    std::atomic<DataIndex> reserve { DataIndex{2} };
};

The storage for ‘DataIndex’ is more than 4-bytes (on most platforms), so on a lot of platforms, it’s NOT atomic on its own. So the compiler is going to insert a bunch of mutexes to protect it. I am assuming this is meant to be lock-free. It isn’t.

There is a ton of errors in this code.

“ton of errors” is a fairly aggressive phrase. Thread Sanitizer has not had any breakpoints triggered when I use this class on my Intel Mac.

Which system are you referring to with regard to the size of DataIndex?

Also (responding via phone), I’m wondering if you could use 3 DataIndex members, and store a pointer to one in the atomic instead, and then cache locally the pointed-to DataIndex in the push & pull functions, and use the cached pointer…

DataIndex read {0}, write {1}, extra {2};
std::atomic<DataIndex*> reserve { &extra };

I have to disagree to this one. On both my M1 Max MacBook and my Intel x84, I can squeeze a struct with four floats (so 4x4 bytes) into a lock free atomic.

@kamedin you can check that by adding static_assert(std::atomic<DataIndex>::is_always_lock_free); to your code. This fails at compile time, when you are trying to build for a platform that does indeed insert a mutex to assure atomic behaviour.

What you have to keep in mind is, that this is a single producer / single consumer structure. Only one thread is allowed to call pull (consume) at a time and only one thread is allowed to call push (produce) at a time. If you already know that this will be the message and audio thread, I would add asserts to protected you from not thinking about this when e.g. adding a thread pool or multi threading your rendering engine. To incorporate that nicely in your software design, you could subclass ValueSender and “override” both methods with an added assert. This little overhead could also be easily excluded from your release builds by hiding the subclass behind an #if JUCE_DEBUG and forwarding the subclass name to your original ValueSender with a typedef or using statement if JUCE_DEBUG is false.

if constexpr (std::is_same<T, juce::AudioBuffer<float>>::value ): this would also be a great place to use the c++20 requires clause: void prepare(int numSamples, int numChannels) requires (std::is_same<T, juce::AudioBuffer<float>>::value). This way, the compiler prevents you from calling prepare when it really doesn’t make any sense and you probably didn’t what to call it.

2 Likes

So you don’t want your code to be portable to 32-bit Intel? Limitations like this should be clearly identified in the comments.

Do you know what ‘false sharing’ is? It’s when both your read index and write index are allocated within the same cache-line. What this does is cause extra unnecessary latency (CPU load) while the processor cores sync up both indexes every time your write to one of them.

segregating the indexes looks like:

#ifdef _WIN32 // Apple don't support this yet
	alignas(std::hardware_destructive_interference_size) std::atomic<int> read_ptr;
	alignas(std::hardware_destructive_interference_size) std::atomic<int> m_committed_write_ptr;
#else
    [[maybe_unused]]char cachelinePad[64 - sizeof(std::atomic<int>)];
    std::atomic<int> read_ptr;
    [[maybe_unused]]char cachelinePad2[64 - sizeof(std::atomic<int>)];
    std::atomic<int> m_committed_write_ptr;
#endif

My point really is not to be critical just for the sake of it, but to emphasise that: Opensource FIFO examples are just a Google away. Writing your own is likely to result in incorrect or low-perfomance code.

Intel 32-bit, is the most common one that I have to support.

Note that thread-sanitizer detects race-conditions, it won’t detect that this code is not lock-free on all platforms.

The biggest red-flag though is the atomic ‘bool fresh’. Because a simple FIFO requires two atomics only: the read index and the write index. ‘fresh’ appears to be redundant, and it also is the cause of the 32-bit incompatibility. What does it achieve?

1 Like

I guess you didn’t intend to tag me. Just in case, my example is indeed spsc. The ready index is the only atomic (read and write indexes are accessed by a single thread), and it includes the changed flag in its third bit.