AudioParameter thread safety


#1

I am looking at the AudioParameterFloat and AudioParameterInt classes in JUCE and I fail to see how they are thread safe. It appears that setValue sets the value and the audio thread gets the value without any sort of fifo or thread safety despite in their base class AudioProcessorParameter it says that getValue and setValue should be “thread aware”. Is it safe to use these classes and set them on the UI thread and get them on the Processing thread as is done in the AudioPlugin Examples?


Naive question about Juce thread safety
#2

The reason they’re done the way they are in the samples is to make the code more readable for beginners. From what I understand it’s technically alright for intel platforms because 32-bit reads/writes are always atomic, so in almost all cases it’s “safe enough” for example code.

How you decide to make them satisfiably thread safe is up to you - there has been debate recently on the best way to achieve inter-thread communication of parameters including double buffering with an intermediate audio processing thread, wrapping parameters in lock-free atomics, or just ignoring the issue for Intel platforms. Someone else with more plugin experience can probably give a better/more definitive answer for real-world practices.


#3

No, it’s not even technically correct - it’s undefined behaviour (race conditions) from the C++ compiler’s point of view, it can and will break code! If you’re writing pure assembly, then you’re technically alright on intel platforms.

Linking this again, in the hope that it will get fixed.

https://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong


#4

There are two aspects to this problem that are often confused here: data synchronisation and data integrity. Roughly, data synchronisation ensures that when one thread writes a value to a shared variable, the other thread will “see” the new value when reading that same variable. @Mayae is absolutely right that without std::atomic the optimiser may move load/stores around or keep copies of variables thus destroying the guarantee of data synchronisation. In fact, I even think @Mayae statement on pure assembly is wrong:

I’m not a hardware CPU expert, but on a multi-core CPU, I believe that the level-1 cache is not shared with other cores so the exact moment another core “sees” the new values is somewhat undefined.

But then there is the other issue around data integrity: that data is never in a teared state. And here @jonathonracz is right in saying that on x86 platforms a single precision floating point value will never be in an in-between state, i.e. half the old value and half the new value to result in some funky floating point value. This will also be true for C++ optimisations (although the C++ standard does not guarantee this!): it would make no sense for the C++ optimiser to break down a single load/store of 32-bits into a several load/stores of smaller widths - say 16-bit (unless maybe if you are using a union) and I’ve never seen a compiler do this.

For many simple audio processors, the exact moment in time when the audio thread “sees” the new parameter value from the GUI is somewhat irrelevant as long as the audio thread “sees” the new parameter value soon enough. And again, from a practical standpoint, and for all relevant x86 CPUs, we can be nearly certain that the value will have been synched within the next audio callback [1].

However, if your audio processor relies on a specific order in which audio parameters are changed then the current JUCE code will not guarantee this order. Also, if you copy an audio parameter to a local variable (for example to avoid audible tearing in your process callback) then you cannot rely on that local variable being constant throughout the process callback: the optimiser may replace the local variable with loads to the actual value which can change it’s value during the process callback.

In most cases this is OK for audio processors as the non-audio thread is the GUI thread. The exact moment in time when a parameter was changed is super inaccurate anyway. Parameter automation data will be delivered via the audio thread, so parameter synchronisation is not an issue here.

Also, note that using std::atomic when not really necessary can be a performance hit. Some compilers will generate a sync instruction every time an atomic variable is read from or written to, forcing the cache to be flushed which can take up valuable time.

[1] This is because the C++ optimiser can only move around load and stores within the plug-in executable/DLL that you are generating. Therefore, any changes to variables must have been written to/read from memory once any plug-in entry point (such as the process, setParameter, getParameter, … callbacks) has finished executing.


#5

[quote]There are two aspects to this problem that are often confused here: data synchronisation and data integrity. Roughly, data synchronisation ensures that when one thread writes a value to a shared variable, the other thread will “see” the new value when reading that same variable. [/quote]There is no confusion: Data integrity is not really the discussed subject here - although that of course is also a problem as you mention, when you don’t use atomics! What we’re discussing here is undefined behaviour from the compilers point of view.

[quote=“fabian, post:4, topic:21097, full:true”]@Mayae is absolutely right that without std::atomic the optimiser may move load/stores around or keep copies of variables thus destroying the guarantee of data synchronisation. In fact, I even think @Mayae statement on pure assembly is wrong:

I’m not a hardware CPU expert, but on a multi-core CPU, I believe that the level-1 cache is not shared with other cores so the exact moment another core “sees” the new values is somewhat undefined.[/quote]No, that is wrong! You can never have two valid copies of a value - the technical term for this problem is cache coherency, and every relevant CPU (for us) guarantees that - even across multiple processors. In fact, I know of no common processors where this is not the case. Here’s a very interesting read of why:

http://acg.cis.upenn.edu/papers/cacm12_why_coherence.pdf

However the C++ standard allows non-cache coherency even for atomics, saying that data should be visible “within a reasonable amount of time”.

Next up is memory ordering - which only describes the order of operations. All x86 hardware is guaranteed strongly ordered which means all stores and loads happen with acquire/release semantics per default. ARM processor are weakly ordered conversely, IIRC, which means the ordering of your operations can now be completely random - a ticking time bomb, if you do not account for it! This is the equivalent of std::memory_order_relaxed.

This stuff is important to get right, especially for multithreaded library implementations! Because the C++ standard memory model is additionally also weakly ordered!

[quote]For many simple audio processors, the exact moment in time when the audio thread “sees” the new parameter value from the GUI is somewhat irrelevant as long as the audio thread “sees” the new parameter value soon enough. And again, from a practical standpoint, and for all relevant x86 CPUs, we can be nearly certain that the value will have been synched within the next audio callback[/quote]As I’ve just stated, from the x86’s hardware’s point of view: You are guaranteed this atomically the exact moment you store the value in the UI thread from the audio thread! But from the compilers perspective - it is free to never reload the value at all - even for the lifetime of the program (hello, infinite loop!): With C++11 it is allowed to reason that nothing else changes the value, because that would be undefined behaviour since the data type is not declared atomic.

These sorts of things are also why people get surprised about what tricks compilers pull on signed overflow - there are filled pages on stack overflow with these sorts of questions. Unfortunately for us, memory ordering and concurrency is highly indeterministic and near impossible to trace, and the bugs will often only reproduce in some customer’s specific computer setup.

[quote]However, if your audio processor relies on a specific order in which audio parameters are changed then the current JUCE code will not guarantee this order.[/quote]This should be documented!

[quote]Also, if you copy an audio parameter to a local variable (for example to avoid audible tearing in your process callback) then you cannot rely on that local variable being constant throughout the process callback: the optimiser may replace the local variable with loads to the actual value which can change it’s value during the process callback.[/quote]Yes but that’s true only for non-atomic values, all the more reason for using atomic variables!

[quote]Also, note that using std::atomic when not really necessary can be a performance hit. Some compilers will generate a sync instruction every time an atomic variable is read from or written to, forcing the cache to be flushed which can take up valuable time. [/quote]You may need to update your compilers. As said, acquire/release on x86 is already the default behaviour, so there’s no overhead anyway. On ARM and others, however, there is - if you need acquire/release semantics - otherwise, just use std::memory_order_relaxed which enjoys the natural performance of the processor - but crucially still informs the compiler that the value can be changed from the outside! If this affects your program performance-wise, it was erroneously written to begin with. Even if so, it probably doesn’t compare to instruction cache misses when you traverse multiple virtual function calls to obtain the parameter value in the first place.

I really don’t know why we’re discussing whether or not this undefined behaviour is okay - of course it isn’t - I really think you should fix this! Yes, data tearing and memory ordering are hardware side-effects but that’s not really the important point here: When you’re allowing undefined behaviour (and especially when you’re not giving the compiler vital information about the data types), the compiler can do all sorts of crazy stuff (as mentioned above in the link), and of course eventually format your harddrive.

Note that the compiler and standard has no notion of ‘entrypoints’, so even this is not guaranteed - the compiler is free to move loads and stores even over function boundaries as long as it’s covered under the ‘as-if’ optimization.


#6

Is this true now? Last time I looked I got an MFENCE for a default atomic operation which was slow.

However checking on godbolt - the following code does not give me an MFENCE with Clang (but does with GCC):

#include <atomic>
std::atomic<int> q; 
void f(int r)
{
  q.store(r);
}

Is load/store using atomics with relaxed memory ordering adequate to ensure that the compiler actually writes/reads a variable - or do we need release/acquire semantics to guarantee correct operation here? I can’t find the right bit in the C++ standard right now - but I note that this:

#include <atomic>
std::atomic<int> q;
int r;

void z(int p)
{
  for (int i = 0; i < 10; ++i)
	q.store(i, std::memory_order_relaxed);
  
  for (int i = 0; i < 10; ++i)
	r = i;
}

Compiles to this:

z(int):
        xor     eax, eax
.L2:
        mov     DWORD PTR q[rip], eax
        add     eax, 1
        cmp     eax, 10
        jne     .L2
        mov     DWORD PTR r[rip], 9
        ret

In which we can see that even with relaxed memory ordering every single value is written to the atomic variable (L2 loop) whereas the compiler optimises like fuck for the second loop writing to a plain int and just shoves a 9 in there… :slight_smile:


#7

Presumably this is ‘permitted’ but in a reasonably complex program it’s very unlikely that the variable is not written to memory.

However I don’t understand enough to be able to argue why that’s the case…? I’ve seen it happen (and break stuff) in tight loops where variables weren’t being updated but never the disaster scenario you’re talking about here.

Maybe the optimiser just isn’t good enough? :slight_smile:


#8

It strikes me, reading this thread again, and assuming we are targeting Intel - that:

  • It’s safe to use bare floats (i.e. unlikely to crash) on Intel
  • It’s not the right solution - the optimisation problem Mayae points out is a possible issue
  • std::atomic with relaxed ordering is more correct
  • std::atomic with relaxed ordering incurs no unwanted time penalty
  • std::atomic also communicates the intent of the code more clearly

It’s a win all round …?

Maybe there’s some reason why acquire/release is important for these parameter operations - but I can’t see why it is?


#9

[quote=“jimc, post:6, topic:21097”]
Is this true now? Last time I looked I got an MFENCE for a default atomic operation which was slow. [/quote]
Yes (you need optimization on, of course). Here you can see an atomic release store is exactly equivalent to a normal store:
https://godbolt.org/g/jRdeJ9

Even though acquire/release actually provides more guarantees than, say, std::memory_order_relaxed.

[quote]Is load/store using atomics with relaxed memory ordering adequate to ensure that the compiler actually writes/reads a variable - or do we need release/acquire semantics to guarantee correct operation here? I can’t find the right bit in the C++ standard right now[/quote]Reading an atomic variable implies a separated transaction (writing to an atomic variable is also a transaction) so yes by inference the compiler is required to read the value each time.

[quote]In which we can see that even with relaxed memory ordering every single value is written to the atomic variable (L2 loop) whereas the compiler optimises like fuck for the second loop writing to a plain int and just shoves a 9 in there… :slight_smile:
[/quote]The compiler is required to do so. If another thread relied on the value r switching (a reasonable assumption) your program might now have a deadlock.

[quote=“jimc, post:7, topic:21097”]
Presumably this is ‘permitted’ but in a reasonably complex program it’s very unlikely that the variable is not written to memory.

However I don’t understand enough to be able to argue why that’s the case…? I’ve seen it happen (and break stuff) in tight loops where variables weren’t being updated but never the disaster scenario you’re talking about here.

Maybe the optimiser just isn’t good enough? :slight_smile:
[/quote]It most likely doesn’t happen, because of the side effects of spilling stackframes (since there are only so many registers on x86) constantly thereby referencing the memory thus being saved due to x86 taking care of everything behind the scenes. It is a totally different scenario on ARM and especially Itanium that has 128 registers!

I summon the apocalypse here because I’ve debugged my fair share of code written by people not considering memory ordering (usually code written on x86, then it gets deployed on all sorts of different processors). It is the hardest, most indeterministic bugs that are almost always impossible to trace to the source, and seeing how the cost of doing it the right way I always advocate loudly for doing so.

The most usual problems are nearly always of this form:

struct a
{
    a * first;
    a * second;
	bool ready;
};

a root;

void do_something(a &);

void thread_a()
{
    while(!root.ready) {}
    do_something(*root.first);
    do_something(*root.second);	
}

void thread_b()
{
    root.first = new a();
    root.second = new a();
    root.ready = true;
}

If any reader is not aware, the compiler is free to do this:

void thread_a()
{
    a & a1 = *root.first, a2 = *root.second; // << compile-time reordering and/or CPU instruction reordering
    while(!root.ready) {}
    do_something(a1);
    do_something(a2);	
}


void thread_b()
{
    root.ready = true; // << compile-time reordering and/or CPU instruction reordering
    root.first = new a();
    root.second = new a();
}

You will need atomics with at least acquire/release semantics for the above not to happen. Using only relaxed memory ordering, you will only ensure the program is not actually undefined behaviour.

edit: See response further below for explanation of how to fix above code.

That code snippet is also why the double-checked locking pattern has been broken in both C++, C and Java until recently. For an example relating to audio parameters? Just consider a delay line where a combination of parameters can alter the delay index. You may say people should “not program like that” - well then, you should at least document it seeing as the language allows the behaviour and precisely defines what can be reasoned about in multithreaded programs. But all bets are off in case of undefined behaviour.

[quote=“jimc, post:8, topic:21097, full:true”]It’s safe to use bare floats (i.e. unlikely to crash) on Intel[/quote]Unfortunately not - for IEEE floats it’s even worse as they have trap-representations that may occur through spilling or tearing!

[quote]It’s a win all round …?[/quote]Not only that, with the aggressive compilers today it is so important to avoid undefined behaviour. I recommend watching this fascinating talks that step through what code-passes and transformations LLVM does when facing undefined behaviour:

It’s especially interesting to note the opening line: Any sort of advanced optimization is indistinguishable from magic. It’s not because compiler-writers like to abuse undefined behaviour, it’s just that it’s near-impossible to distinguish when you’re deep down in optimization passes and have done a dozen of transformation passes.

[quote]Maybe there’s some reason why acquire/release is important for these parameter operations - but I can’t see why it is?
[/quote]See beforementioned example.


#10

Hi @Mayae,

Thanks for your insights. Great post! I just learned a lot about atomics and memory ordering!

I am not arguing at all against std::atomics. That’s why I mentioned some examples of how things go wrong when you don’t use std::atomics. We also discussed yesterday in the JUCE team that we will be adding atomics to places in JUCE where necessary - also to AudioParameter. I’m also glad to get some clarification on performance of std::atomics in state-of-the-art compilers - last I checked compilers did create the sync instruction (even when optimisation was enabled) but that was a few years back.

I’m just also making the point that no-one needs to panic just yet when AudioParameter is not using atomics, as in practice, it’s very unlikely that things will go wrong (apart from the examples that I mentioned) and I’d even argue that it’s impossible when you are building an audio plug-in.

Let me clarify that last point again: an audio plug-in is a shared library and the compiler is technically unable to optimise beyond the entry points of the plug-in (how should it? - it doesn’t have access to the host program’s source code or binary). Therefore, any load/stores that have been optimised away or moved around will need to be inserted somewhere before any entry-point finishes. I know that the C++ standard does not mention this - so technically the world could explode, but I’m thinking of how all compilers currently work in practice.


#11

I thought you were talking about the default operation which is the sequential consistency ordering, which for a store introduces the super-slow MFENCE: https://godbolt.org/g/nl8ipl - as I think does the JUCE::Atomic<> class which is a shame, because most of the time it’s probably not what’s needed.

Explain this one? I’m surprised it’s possible for an IEEE 32-bit float, on Intel, to tear? And I need to know what spilling is - because I don’t :slight_smile:

Going to read the rest of your post later - need to get some stuff completed this morning before I get further into this.


#12

I don’t think the problem is what happens beyond the entry points of the plugin?

There’s a chain of function calls when the parameter is modified resulting in a function call to the host API.

I think the ambiguity is just between the threads in the plugin code?


#13

Yes, but I’m trying to argue that the compiler, for example, would not replace a load to a parameter with some kind of endless loop because it cannot reason on how the process callback is invoked (for example that it is invoked in a loop). It therefore must reload the parameter value somewhere in the audio process callback.


#14

Ok, agreed. Sounds like it must be true :slight_smile:


#15

No problem and I’m happy as long as it’s being fixed :slight_smile: I agree we don’t need to panic, but we also should not neglect it.

[quote=“fabian, post:10, topic:21097”]
I’d even argue that it’s impossible when you are building an audio plug-in.
[/quote]Let’s not go there :wink: as said I’ve fixed and maintained such bugs in audio plugins.

And for the last part, remember that the CPU can reorder and defer instructions as well. If we go into technicalities, the compiler is of course also allowed to never reload any such value on a thread, where mentioned value is not annotated atomic (if the platform is suitable, it may for instance keep a copy of the variable in TLS which may be faster). Obviously this doesn’t happen now for x86, but given how cores and concurrency moves ahead it probably is the future.

[quote=“jimc, post:11, topic:21097”]
I thought you were talking about the default operation which is the sequential consistency ordering, which for a store introduces the super-slow MFENCE: https://godbolt.org/g/nl8ipl - as I think does the JUCE::Atomic<> class which is a shame, because most of the time it’s probably not what’s needed.
[/quote]Ah, the default operation does enforce sequential consistency (as it should) yes :slight_smile: That’s why you have the ability to define the exact ordering of your operations. I wouldn’t want it to work any other way - the default should always be the safest option (principle of least surprise)… ?

[quote=“jimc, post:11, topic:21097”]
Explain this one? I’m surprised it’s possible for an IEEE 32-bit float, on Intel, to tear? And I need to know what spilling is - because I don’t :slight_smile:
[/quote]x86 only guarantees atomicity for correctly aligned stores & reads, so that’s one way. One other way is exactly through spilling, as is demonstrated with code examples in my previous link Benign data races: What could possibly go wrong? Spilling is the act of dumping register values back to memory to make space for new temporary computations. The memory spilled to is often something recently used (ie. in the cache) for instance the stack, or some recently read memory location.

To put it simply, the compiler may choose to temporarily store some byte/integer value from a register into the memory location of your audio parameter float. This is a reasonable optimization. Because you didn’t annotate your parameter as atomic, the compiler relies on the fact that no other thread will ever read that value concurrently. In fact, it is completely allowed to never restore the contents of the parameter memory location under the as-if rule.

[quote=“fabian, post:13, topic:21097, full:true”]
Yes, but I’m trying to argue that the compiler, for example, would not replace a load to a parameter with some kind of endless loop because it cannot reason on how the process callback is invoked (for example that it is invoked in a loop). It therefore must reload the parameter value somewhere in the audio process callback.
[/quote]It is completely allowed to do this. The compiler doesn’t reason about DLL invocations etc. (something like this is not covered in the standard), however just think about it: If the variable isn’t atomic, and the current thread doesn’t change the variable, the compiler can and will assume that the value doesn’t change. It is easy to see how this can constitute an endless loop.


Thread Sanitiser
#16

So whats the easiest way to fix that particular problem? Make every variable an atomic? Could you please “fix” your example?


#17

The sentence immediately following the code snippet says:

However, since there are actually more optimizations possible I will quickly walk through it…
So to make the above code work, you only need to do this and it compiles and works out of the box:

struct a
{
    std::atomic<a *> first;
    std::atomic<a *> second;
	std::atomic<bool> ready;
};

However, this actually enforces a much stricter memory ordering (sequential consistency) than what we need. We only care about that first and second fields are committed before we set the ready field. Whether first or second is reordered doesn’t matter. Continuing, when we read it we are then guaranteed that the first and second fields are valid when we read the ready flags, thus we again don’t care that they may be reordered because we established a happens-before relationship (this keyword may assist in googling). The fixed and optimized code thus look like this:

void thread_a()
{
    while(!root.ready.load(std::memory_order_acquire)) {} // <- check whether transaction is done
    do_something(*root.first.load(std::memory_order_relaxed));
    do_something(*root.second.load(std::memory_order_relaxed));	
}

void thread_b()
{
    root.first.store(new a(), std::memory_order_relaxed);
    root.second.store(new a(), std::memory_order_relaxed);
    root.ready.store(true, std::memory_order_release); // <- signal the transaction is done, and everything else is valid
}

We can see the release and acquire operations define the boundaries of the transactions, ensuring everything that happens before and after is synchronized/committed at that point. This is called a memory fence or a memory barrier. It is a valuable optimization to control this, because relaxed memory orderings have no overhead for the CPU compared to acquire/release/seq_cst that may have overhead - as said, this is not even the case for acquire/release on x86.

As we see, the proportion of strong synchronization to weak synchronization is often small, so this can give huge gains while writing lock-free code.


#18

Wouldn’t it be enough, to only make the ready flag atomic?

Because of the “Release-Acquire ordering”, "all memory writes (non-atomic and relaxed atomic) that happened-before the atomic store from the point of view of thread A, become visible side-effects in thread B, that is, once the atomic load is completed, "
Example: http://en.cppreference.com/w/cpp/atomic/memory_order#Release-Acquire_ordering
(in the example “int data” is a non atomic variable, and the assert “never fires” - offtopic can somebody enlighten me why in this example the ptr is not initialized with a nullptr? )

Also, the juce library uses a few times the “volatile” keyword (i think for the same reason) which is obviously unsuitable for modern threading techniques, (for example in juce_win32_Midi.cpp:144: bool volatile isStarted;)
so just replacing this volatiles with std::atomic should fully prevent potential threading issues, i guess…?


#19

I wasn’t actually aware the standard guarantees that, so yes, you’re correct! Of course it only works if you consistently follow this model, ie. you may never alter these fields without the memory barriers anyway. Since the relaxed atomic operation is free for all intents and purposes anyway (as long as std::atomic::is_lock_free() is true), there will be no difference in the generated code.