If you disable priority inheritance, you’ll get deadlock in a realtime kernel.
The basic issue the PI is trying to solve is this one:
Thread A with priority Low takes a mutex.
Thread B with priority High gets ready, and preempt thread A (on a RT kernel).
Thread B tries to take the same mutex and fail to do so (since Thread A got it).
The kernel scheduler kicks in, but since the Thread B is priority High, it got re-elected, and the computer is dead running infinite loop.
PI solves this because when Thread B tries to lock the mutex, the priority of the Thread B is transfered to the mutex, which in turns is transfered to the thread holding the mutex (Thread A). So when it’s scheduled, the Thread A got a temporary “High” priority, so it’s elected and can release the mutex.
On non RT kernel, high priority thread can be interrupted by low priority thread (I think the default rule is 95% of the time is spend on high priority thread), so the issue above will solves itself in a long time.
PI adds some overload to the mutex & conditions, yet, it works on both RT and non RT kernel, so it should definitively be enabled.
It’s not possible to actually design a working code with mutex with realtime priority threads without PI enabled on linux, so if you are about to remove it, you need to forbid changing the priority of a thread (and this means a lot of changes in the code, especially in ALSA code which run with the highest priority).
I’m sorry to be rude, but if you guys want to work with a RT kernel, you must know enough to be able to debug RT specific issues.
RT means that a lot of RT-hacks in typical non RT kernel are disabled, and you’ll hit the issues that those hack typically hides (like the PI issue that the hack not giving 100% of CPU to the highest priority thread allows to hide).
Debugging RT software means having a kdb console attached to the RT computer so when it locks, you can get in remotely and figure out what thread is doing what.
If you don’t want to deal with such hassle, you can run on a non RT kernel, but disable some non RT hacks.
Here’s a list of the option to disable in a linux kernel to let it act like a real RT scheduler:
// Don't let the CPU clock scaling break your timing routines
- foreach CPU echo "performance" > /sys/devices/system/cpu/cpu%d/cpufreq/scaling_governor
// Disable non-RT scheduler allocation time
- cat /proc/sys/kernel/sched_rt_period_us > /proc/sys/kernel/sched_rt_runtime_us
I’m not sure I understand. “Normal” juce applications , such as Introjucer, do not make use of any real-time priority threads, correct ? In fact, nowhere in juce code are SCHED_RR or SCHED_FIFO threads used as far as I know. So why are there all these issues with rt-kernels that end up with a system freeze, I though the “-rt” kernels were almost the same as the non “-rt” kernels and that anyway if there was a difference, it was only when realtime priority threads where involved, which is not the case with Introjucer
No, you’re missing the part in the setThreadPriority.
If you use the maximum priority, it’s set to SCHED_RR (see juce_posix_SharedCode.h )
The ALSA thread is using such priority, and since the error that’s reported here is about sound code and PI, I wonder if that this is somehow linked.
I don’t master the sound code enough to be able to locate an issue, but I’m 100% sure the (1.53) audio code is running correctly on a RT kernel, since I’m using that in a product I’ve written.
There is nothing in the post about stating that the Introjucer is having any issues.
I’m working with the developer of Xenomai at my office, and he knows a lot about RT kernels since he wrote one. The advice I’m giving above comes from him, and they proved useful when I developped my product to locate RT issues.
Just to clarify, this condition exists with ALL juce applications, Introjucer, example code, everything.
Also, I don’t think it is a requirement to be able to debug kernel issues for running a realtime kernel. If you are using any serious music software under Linux, Pianoteq, jack, ffado etc, the realtime kernel is the recommended option. Most apps run just fine with this kernel. To use ffado and jack with a firewire audio device reliably, the rt kernel is the only solution.
I can confirm that Introjucer definitely also crashes the system. My small test App also has no setPriority calls and also crashes.
The fix suggested by falkTX earlier does work though (Thanks!), so if you intend to research it further that would be a good place to start looking. For now I’m just going to go with that suggestion, and will suggest Pianoteq adopt that too.
Can you post the crash stack so we can try to guess where it crashes ?
What is your exact kernel version ?
Did you try the suggestion from Jules ?
What’s running on the computer at the same time ?
Can you try to run the same software in an “init 1” mode (minimal single user mode, run “init 1”, then “Xorg &” then “Introjucer”), to check if it’s still crashing?
As it is written “kernel bug” I think one can hardly accuse juce of being buggy here. What I notice is that the suggested patch of falkTX reverts the PTHREAD_PRIO_INHERIT attribute on the mutexes for pthread condition variables (of juce::WaitableEvent), but it does not revert it for the “normal” mutexes of juce::CriticalSection . Maybe it is a bug specific to condition condition variables.
Anyway since that patch seems to fixes all issues related to this kernel bug , I would suggest that Jules applies it.
UPDATE: well… I have one user saying that he still has some freezes of the application ui (not the whole OS) so maybe that patch is not the silver bullet
No, the PI is required. If you remove it, you MUST forbid RT thread (because as soon as a RT thread takes a mutex, the whole computer is dead).
And if you forbid RT threads, the audio code suddenly drops sample depending on the CPU usage (I’m not speaking about video here that’s even worst).
A kernel bug is likely not due to the application (well, sort of), so there is nothing you can do on the application. You should debug your kernel instead.
I can only help you debug the kernel. So, if you have the source of your kernel (you likely do), take a look to kernel/rtmutex.c:724.
You’ll have a line “BUG_ON(some condition)”.
Then search this line on google, it’s likely other users have hit that bug, and probably there is already a fix for it.
Also, if you have debug information in your Juce software, use addr2line to find out the file & line source code where the kernel crashed:
However to test, I changed this to a WARN_ON call instead (compiled all, rebooted etc), and now it crashes at line 472 (Coincidentally a transposition of the same numbers - I had to double check that). This is now the result of another BUG_ON statement:
BUG_ON(rt_mutex_real_waiter(task->pi_blocked_on)); in the task_blocks_on_rt_mutex function.
I remain convinced that this is a fault in Juce somewhere, but am at a loss on how to progress.
Yes it’s the case.
This behaviour is forbidden unless the recursivity of the mutex is switched on.
But in the Juce code, you have “pthread_mutexattr_settype (&atts, PTHREAD_MUTEX_RECURSIVE);” (in juce_posix_SharedCode.h)
Usually, people using PI don’t use recursivity at the same time, but it’s not the case in Juce.
So, this code path is probably not tested that much.
Anyway, have you tried the addr2line call I’ve written above so we can figure out the position in the Juce code that’s causing the issue ?