Timur Doumler Talks on C++ Audio (Sharing data across threads)

Has anyone who chimed in seen this talk?

from the description:

The C++ elites have been talking for years about value semantics, immutability and sharing 
by communicating. A better world without mutexes, races, observers, command patterns 
and so more lies ahead! When it gets to doing it practice, it is not so easy. One of the 
main problem lies in our data structures... 

Immutable data structures don't change their values. They are manipulated by producing 
new values. The old values remain there, and can be read safely from multiple threads 
without locks. They provide structural sharing, because new and old values can share 
common data--they are fast to compare and can keep a compact undo-history. As such, 
they are great for concurrent and interactive systems: they simplify the architecture of 
desktop software and allow servers to scale better. They are the secret sauce behind 
he success of Clojure or Scala and even the JavaScript crowd is loving it via 
Facebook's Immutable.js. 

We are presenting Immer, a C++ library implementing modern and efficient data 
immutable data structures. 

In this session, we will talk about the architectural benefits of immutability and show 
how a very efficient yet powerful persistent vector type can be built using state of the 
art structures (Relaxed Radix Balanced Trees). We will also show an example application 
(a text-editor) built using the archectural style here proposed... not only is its code 
extremely simple, it outperforms most similar programs. Don't believe it? 
Come and see!

the bit about it being good for concurrency has me interested if it would work for audio.

3 Likes

From what I understand, the implementation relies on shared_ptr, that relie on mutexes so I am not sure if suited for audio.

Otherwise, it is a very pleasant paradigm to reason about, like immediate mode gui frameworks (nuklear or imgui).

2 Likes

Hi there. First of all, sorry for not explaining this properly in my talk back in 2015 (I didn’t understand it very well myself back then :wink: )

Yes, as many people said here, std::shared_ptr (including the std::atomic_... functions for it) in C++11/14/17 is atomic, but not lock-free, because it has to store a pointer + a control block with two integers (strong count and weak count). Modern Intel CPUs have double-width CAS instructions. You can tell because, for example, std::atomic<std::complex<double>>::is_always_lock_free == true on my MacBook. DWCAS gives you 128-bit atomic variables.

So if you cram the control block into a single 64-bit word, and then the pointer into the other word, and do some implementation heroics (I do not know how to do this correctly, but I know it’s horrendously complicated), you can implement a lock-free shared pointer. This was an active area of research back in 2015, but it seems that now people have figured out how to do this.

Such a lock-free shared ptr will however not be std::shared_ptr, but a different type. This type will be introduced in C++20 as a new specialisation std::atomic<std::shared_ptr> (see https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic2) and I expect that the major standard library implementations will ship a lock-free implementation of this (i.e. std::atomic<std:: shared_ptr>::is_always_lock_free will hopefully be true on Intel CPUs on libc++/libstdc++/MSVC a year from now).

Until then, you’re stuck with what we have – a non-lockfree shared ptr in the standard library (with an atomic, but non-lockfree API) - and various non-standard 3rd party implementations of non-standard lock-free shared ptrs. I know that Just::Thread has a great such implementation, but it’s commercial software. Not aware of any high-quality free implementations (please let me know if they exist!)

Basically, hold tight, everything will be great after C++20 comes out :slight_smile:

8 Likes

Juanpe Bolivar worked at Ableton, so it’s likely his work is relevant to us.

1 Like

I used the library in a JUCE / audio project for some time, and watched the talk. The developer is a nice guy and was very responsive. I was mainly interested in immer::vector and maybe flex_vector.

When I used it, it was still LGPL-licensed and didn’t compile on VS2017 (issue is still open, may be fixable with minor changes). I don’t think it’ll compile in older VS versions, either. I ended up writing my own data structure using chunks of std::vector with the operations I really needed, to avoid an additional dependency and because I need to support Visual Studio.
Bottom line: I recommend trying it out. Check whether its data structures suit your use cases. The immer::vector uses chunks, so cache locality is “slightly worse” than std::vector (measure!), but allows faster push_back etc. You could even use it for your “message-thread-only” state, but then again JUCE provides ValueTree for that, which is well integrated and convenient. Sidenote: I worked a lot with Haskell for ~2 years and am a fan of immutability and map/filter/fold :slight_smile:

Yesterday, after posting in this thread, I had a delightful conversation with standard library implementers on Twitter:

Bottom line:

  • contrary to what I said earlier, we probably won’t get lock-free atomic<shared_ptr> in the standard library.
  • contrary to what I said earlier, we don’t actually know yet whether they’re implementable. Two people claim to have an implementation, but they’re not open source and there are doubts whether the implementations are correct+conforming;
  • regardless of any of that, and contrary to what I said in my 2015 talk, std::shared_ptr seems to be fundamentally the wrong tool for the use case of sharing objects across threads, of which one is a realtime thread, and the pattern I described is actually fundamentally broken and unsafe.

I need to go off and rethink all this. Expect a blog post from me when I figured it out :slight_smile:

How do you do this in your codebases? How do you share objects across threads, of which one is a realtime processing thread?

I think what we really need in C++ is a [[nosuspend]] attribute we can add to a block to stop the thread scheduler suspending a thread whilst it increments some counters or swaps some atomic pointers. If we had that, I’d be fine with a spin lock guarding this.

I’ve no idea if that’s even implementable though or if we have any control over the thread scheduler in C++.

Something like this would be rampant for abuse as well (but so is a mutex to be fair).

No, we don’t. Standard C++ does not have any concept of thread priorities or thread scheduling.

Actually, I think for what you’re describing a good old mutex is better than a spinlock.

Why is that?

The problem with a mutex is that it can result in a system call, which is more difficult to reason about how long it will take.

Spin locks are busy wait so will waste a few cycles spinning but don’t take any time to resume once unblocked. The problem is if your message thread gets suspended whilst the audio thread is waiting on the spin lock.

I have heard of optimisers being pessimistic on spin-locks for this reasons and hence mutexes outperforming them.

It’s all a bit voodoo here though and the bottom line with audio is if you can’t guarantee how long it will take then don’t do it in the audio thread.

I think we should stop speculating about performance, because the only way to find out is to measure.

That being said, afaik busy waiting is bad because the other thread you’re waiting for might yield, and you might spend a lot of cycles spinning while the thing you’re waiting for doesn’t make any progress.

I’ve actually come to the conclusion that mutexes might not be that bad after all. Yes, it’ll take a few hundred instructions to acquire a mutex, but as long as the critical section that the mutex is guarding is only a few instructions long, you won’t pay more than that acquisition cost afaik. And yes, it would show up in the profiler as a hotspot, but so would a spinlock.

Or do you have evidence that locking a mutex might result in some much slower system call? If so, where can I read more about this?

But again, I should stop speculating. We need to look at some actual measurements.

Yeah, like I said, it’s not about performance of CPU cycles, it’s about an indeterminate number of cycles.

The other problem with audio is that you can measure all you like, if you’re running for days on your test system and nothing glitches that’s fine, but what about when someone is playing at Wembley Stadium…

I do agree with the measurement though, in Tracktion Engine we have an RAII class called RealtimeCheck which logs if a scope takes more than 1.5ms.

I’m thinking of creating one that counts a much shorter time period to catch thread sleeps inside short blocks that can’t be atomic for size reasons (e.g. 4 doubles) without having to use strace etc.

The problem is that these measurements themselves add overhead…

Expect a blog post from me when I figured it out

Where? I’d love reading that.

Why does acquiring a mutex (assuming it’s not currently locked) take an indeterminate number of cycles?

It’ll be on https://timur.audio.

1 Like

Ok, it’s not the acquiring that takes an indeterminate number of cycles, it’s the fact that you don’t know who else could be holding it and if that thread that also has it is suspended or not.

At that point you’re at the mercy of the thread scheduler and I don’t think there’s anything that states if you request a mutex lock, any holders will be woken up and it will happen within X number of cycles. You just don’t know how long this can take and if it takes more than a ms or two you’ll have missed your audio buffer.

That makes sense.

If the code under lock is just a couple lines of code (and let’s say that code itself is runtime safe), then the probability of the situation you’re describing will be extremely low.

But iiuc, you’re saying that that’s not quite good enough, because even if the probability is low that the scheduler will suspend that code in the middle of those couple of lines, it’s still technically possible and thus the runtime is indeterminate.

And that makes sense.

But don’t you have exactly the same problem if you use a spinlock instead of a mutex? Or, in fact, any other thread synchronisation mechanism in existence?

Taking this thought even further, afaik you don’t have any guarantee that the audio thread itself won’t be suspended. I mean, on many systems there will be some mechanism giving higher priority to the processing thread, but at the end of the day it’s “just” a regular thread. None of this is really real-time anyway. Windows/macOS/iOS/Linux/Android aren’t real-time operating systems. iiuc, on those operating systems there is fundamentally no such thing as a deterministic number of cycles after which any given function will finish. Some OSes do a good job. But it’s not deterministic in the strict sense of the word (i.e. not like a proper real-time system).

Yes, the spin lock comment was a red herring. I was mainly speculating that if you’re only garding a couple of variables, the relative cost of a busy wait could be lower than the cost of a mutex lock, providing neither thread gets suspended whilst they’re in contention.

But they both suffer from these problems.

Hence my suggestion of a [[nosuspend]] to indicate that this section really shouldn’t be suspended. But that probably isn’t even implementable and out of scope for C++ anyway. For that I think you need some kind of real-time kernel trickey.

Exactly right.

We had this discussion briefly in Kona when we discussed std::audio – whether the standard should say anything about “real time” or “deterministic execution time” or those things. The overwhelming consensus is that we don’t want to go there. The current systems running audio apps aren’t real-time systems, and C++ is not a real-time language.

That’s true. I think the audio thread is some “special” kind of thread though. I think it has a higher priority than any thread you can create and the OSes might try a lot harder to not suspend it. Obviously if you just sit there and block it the OS will eventually though.

Again, this is getting a bit beyond my understanding of OSes and thread schedulers though.