Juce plugins cause realtime kernel lockup

Me and Colin (from Loomer plugins) have been discussing this for a while, and he found the problematic juce commit:
http://juce.git.sourceforge.net/git/gitweb.cgi?p=juce/juce;a=blobdiff;f=src/native/common/juce_posix_SharedCode.h;h=7ca97a16ab09169b14ae07c59873af66dc500cd3;hp=73124dd9ac69d54f51779bbfe0ffc84ab6e9b9e5;hb=cc45ec88f5b9c56d18081707ee191b476b44ff68;hpb=1f21a9475399fd8954be714560c83a36c06d309e

In my juce code I’m using this workaround for now:
http://distrho.git.sourceforge.net/git/gitweb.cgi?p=distrho/distrho;a=commitdiff;h=d8829e6dd0621b9361ed624afba965cd7f53fc2d#patch1
It’s not very pretty (reverting to old code if linux), but it makes the kernel hangups go away.

Looks like this:
http://lists.freebsd.org/pipermail/freebsd-current/2009-March/004301.html

Seems to be a kernel bug, though?

If you disable priority inheritance, you’ll get deadlock in a realtime kernel.
The basic issue the PI is trying to solve is this one:
Thread A with priority Low takes a mutex.
Thread B with priority High gets ready, and preempt thread A (on a RT kernel).
Thread B tries to take the same mutex and fail to do so (since Thread A got it).
The kernel scheduler kicks in, but since the Thread B is priority High, it got re-elected, and the computer is dead running infinite loop.

PI solves this because when Thread B tries to lock the mutex, the priority of the Thread B is transfered to the mutex, which in turns is transfered to the thread holding the mutex (Thread A). So when it’s scheduled, the Thread A got a temporary “High” priority, so it’s elected and can release the mutex.

On non RT kernel, high priority thread can be interrupted by low priority thread (I think the default rule is 95% of the time is spend on high priority thread), so the issue above will solves itself in a long time.

So… should my events not be using the PTHREAD_PRIO_INHERIT flag?

TBH I don’t remember why I added it - I think it must have been a suggestion from somebody else, though I can’t find any emails or posts that mention it (?)

No, I think it’s the opposite.

PI adds some overload to the mutex & conditions, yet, it works on both RT and non RT kernel, so it should definitively be enabled.
It’s not possible to actually design a working code with mutex with realtime priority threads without PI enabled on linux, so if you are about to remove it, you need to forbid changing the priority of a thread (and this means a lot of changes in the code, especially in ALSA code which run with the highest priority).

I’m sorry to be rude, but if you guys want to work with a RT kernel, you must know enough to be able to debug RT specific issues.
RT means that a lot of RT-hacks in typical non RT kernel are disabled, and you’ll hit the issues that those hack typically hides (like the PI issue that the hack not giving 100% of CPU to the highest priority thread allows to hide).
Debugging RT software means having a kdb console attached to the RT computer so when it locks, you can get in remotely and figure out what thread is doing what.
If you don’t want to deal with such hassle, you can run on a non RT kernel, but disable some non RT hacks.
Here’s a list of the option to disable in a linux kernel to let it act like a real RT scheduler:

// Don't let the CPU clock scaling break your timing routines
- foreach CPU  echo "performance" > /sys/devices/system/cpu/cpu%d/cpufreq/scaling_governor
// Disable non-RT scheduler allocation time
- cat /proc/sys/kernel/sched_rt_period_us > /proc/sys/kernel/sched_rt_runtime_us

Thanks Cyril! (Linux has always been one of my weaker subjects!)

I’m not sure I understand. “Normal” juce applications , such as Introjucer, do not make use of any real-time priority threads, correct ? In fact, nowhere in juce code are SCHED_RR or SCHED_FIFO threads used as far as I know. So why are there all these issues with rt-kernels that end up with a system freeze, I though the “-rt” kernels were almost the same as the non “-rt” kernels and that anyway if there was a difference, it was only when realtime priority threads where involved, which is not the case with Introjucer

No, you’re missing the part in the setThreadPriority.
If you use the maximum priority, it’s set to SCHED_RR (see juce_posix_SharedCode.h )

The ALSA thread is using such priority, and since the error that’s reported here is about sound code and PI, I wonder if that this is somehow linked.
I don’t master the sound code enough to be able to locate an issue, but I’m 100% sure the (1.53) audio code is running correctly on a RT kernel, since I’m using that in a product I’ve written.
There is nothing in the post about stating that the Introjucer is having any issues.

I’m working with the developer of Xenomai at my office, and he knows a lot about RT kernels since he wrote one. The advice I’m giving above comes from him, and they proved useful when I developped my product to locate RT issues.

Just to clarify, this condition exists with ALL juce applications, Introjucer, example code, everything.

Also, I don’t think it is a requirement to be able to debug kernel issues for running a realtime kernel. If you are using any serious music software under Linux, Pianoteq, jack, ffado etc, the realtime kernel is the recommended option. Most apps run just fine with this kernel. To use ffado and jack with a firewire audio device reliably, the rt kernel is the only solution.

Umm… seems unlikely that Introjucer would be running a high-priority process!

AFAICT the fix for this is just to remove line 108 of MainHostWindow.cpp, so that the app doesn’t run at realtime priority?

I can confirm that Introjucer definitely also crashes the system. My small test App also has no setPriority calls and also crashes.

The fix suggested by falkTX earlier does work though (Thanks!), so if you intend to research it further that would be a good place to start looking. For now I’m just going to go with that suggestion, and will suggest Pianoteq adopt that too.

Can you post the crash stack so we can try to guess where it crashes ?
What is your exact kernel version ?
Did you try the suggestion from Jules ?
What’s running on the computer at the same time ?
Can you try to run the same software in an “init 1” mode (minimal single user mode, run “init 1”, then “Xorg &” then “Introjucer”), to check if it’s still crashing?

$ uname -a
Linux Euan_AMD64 3.0.9-rt25-rt25 #13 SMP PREEMPT RT Sat Mar 10 18:09:58 WST 2012 x86_64 AMD Phenom™ II X6 1055T Processor AuthenticAMD GNU/Linux

My machine is running Gentoo Linux.

There is very little running on the computer at the time - just some typical utility apps - dropbox etc. I use XFCE to keep my system as light as possible.

top reports about 600Mb usage out of a total RAM of 12Gb (Firefox is responsible for at least half of that)

Running from init 1 (+ /home manually mounted to enable normal user login) still causes a crash.

I’ve uploaded a screen photo showing the relevant details of the crash, unfortunately I can’t do anything with the computer once it crashes, and even with “ulimit -c” I don’t get a core dump.

Let me know if you have any other ideas. Thanks for the effort.

As it is written “kernel bug” I think one can hardly accuse juce of being buggy here. What I notice is that the suggested patch of falkTX reverts the PTHREAD_PRIO_INHERIT attribute on the mutexes for pthread condition variables (of juce::WaitableEvent), but it does not revert it for the “normal” mutexes of juce::CriticalSection . Maybe it is a bug specific to condition condition variables.

Anyway since that patch seems to fixes all issues related to this kernel bug , I would suggest that Jules applies it.

UPDATE: well… I have one user saying that he still has some freezes of the application ui (not the whole OS) so maybe that patch is not the silver bullet

No, the PI is required. If you remove it, you MUST forbid RT thread (because as soon as a RT thread takes a mutex, the whole computer is dead).
And if you forbid RT threads, the audio code suddenly drops sample depending on the CPU usage (I’m not speaking about video here that’s even worst).

A kernel bug is likely not due to the application (well, sort of), so there is nothing you can do on the application. You should debug your kernel instead.
I can only help you debug the kernel. So, if you have the source of your kernel (you likely do), take a look to kernel/rtmutex.c:724.
You’ll have a line “BUG_ON(some condition)”.
Then search this line on google, it’s likely other users have hit that bug, and probably there is already a fix for it.

Also, if you have debug information in your Juce software, use addr2line to find out the file & line source code where the kernel crashed:

addr2line -e /path/to/your/juce/App 0xffffffff8130F3a8

or

addr2line -e /path/to/your/juce/App 0xffffffff81494897

Please post all the data here.

Did anyone manage to reproduce this freezing in a virtual machine ? I tried running the linux-rt kernel of ubuntu lts 10.04 in virtualbox and it did not freeze.

(I also tried to set /proc/sys/kernel/sched_rt_runtime_us equal to/proc/sys/kernel/sched_rt_period_us on a regular linux distro, not in a VM, and it did not freeze)

@jpo, me neither, as I’m using the vanilla Juce code on my rt-kernel and it works ok.

OK, my progress so far:

I have had a look at the rtmutex.c code.

The affected line (472) is this…

BUG_ON(rt_mutex_owner(lock) == self); in the rt_spin_lock_slowlock funtion.

There was an attempt to report this as a bug a while back, but it has been defended as a valid check. It looks like the app is trying to obtain a spin lock twice from the same function. The trail around this can be found at:
http://lkml.indiana.edu/hypermail/linux/kernel/0706.2/3258.html

However to test, I changed this to a WARN_ON call instead (compiled all, rebooted etc), and now it crashes at line 472 (Coincidentally a transposition of the same numbers - I had to double check that). This is now the result of another BUG_ON statement:
BUG_ON(rt_mutex_real_waiter(task->pi_blocked_on)); in the task_blocks_on_rt_mutex function.

I remain convinced that this is a fault in Juce somewhere, but am at a loss on how to progress.

Yes it’s the case.
This behaviour is forbidden unless the recursivity of the mutex is switched on.
But in the Juce code, you have “pthread_mutexattr_settype (&atts, PTHREAD_MUTEX_RECURSIVE);” (in juce_posix_SharedCode.h)
Usually, people using PI don’t use recursivity at the same time, but it’s not the case in Juce.

So, this code path is probably not tested that much.
Anyway, have you tried the addr2line call I’ve written above so we can figure out the position in the Juce code that’s causing the issue ?

I’m being told that after removing both uses of PTHREAD_PRIO_INHERIT in juce_posix_SharedCode (the one in WaitableEventImpl and the one in CriticalSection) , everything seems to work.