Juce plugins cause realtime kernel lockup

Well, it depends on what “everything” means. If you’re testing the Introjucer that’s not using any RT thread, of course, it’s going to work.
Can you try with a RT thread that’s taking a mutex, something like this:

CriticalSection mutex;

class RT : public Thread
{
     virtual void runThread()
     {
           setCurrentThreadPriority(10);
           while (!threadShouldExit())
           {
                 for (volatile int i = 0: i < 500000; i++);

                 {
                      ScopedLock scope(mutex);
                      for (volatile int i = 0: i < 1000000; i++);
                 }
           }
     }
};


class LowP : public Thread
{
     virtual void runThread()
     {
           while (!threadShouldExit())
           {
                 {
                      ScopedLock scope(mutex);
                      for (volatile int i = 0: i < 1000000; i++);
                 }
           }
     }
};

Also, can you try to disable recursive mutex (pthread_mutexattr_settype (&atts, PTHREAD_MUTEX_RECURSIVE):wink: instead of disabling PI ?
In POSIX, the behaviour of using recursive mutex with cond_var is undefined, so I wonder if the bug is due to this (instead of PI).
see: Reentrant mutex - Wikipedia
From there, it reads:
It is advised that an application should not use a PTHREAD_MUTEX_RECURSIVE mutex with condition variables because the implicit unlock performed for a pthread_cond_timedwait() or pthread_cond_wait() may not actually release the mutex (if it had been locked multiple times). If this happens, no other thread can satisfy the condition of the predicate.

This would be a major flaw that needs fixing in that case.

No I was not referring to the Introjucer, but a synth with SCHED_RR threads for midi and audio. The thing is , I’m trying to be pragmatic here, without the fix I’m told that the application immediately freezes the computer , with the fix I’m told that it works – it has also been working for a few years with an older version of JUCE that did not have this priority inheritance attribute. Maybe it is pure luck, or maybe the application just avoids the dangerous situations, I don’t know.

I can’t reproduce anything on my computers so cannot really try your suggestions ; however I don’t think juce uses recursive mutexes for condition variables – recursive is only used for its criticalsections

Yes, I’ve managed to reliably get this to lock in Arch RT using the JuceDemo app. In fact, on my VM, it’s as easy as just running multiple instances of the JuceDemo simultaneously. 100% of the time, this will lock immediately with the afformentioned rtmutex.c assertion :

[ 683.649543] ------------[ cut here]------------ [ 683.649561] kernel BUG at kernel/rtmutex.c:472! [ 683.649574] invalid opcode: 0000 [#1] PREEMPT SMP [1072] exited with preempt_count 2

I’m admittedly way out of my comfort zone when it comes to RT kernel debugging, but I will take a look at X-Ryl669’s suggestions regarding PTHREAD_MUTEX_RECURSIVE and see what I can come up with…

I’m also way out of my depth when it comes to RT linux, so am eagerly waiting for you guys to come to a consensus on this one!

Nothing to add, but I can confirm what edekock has been saying. The only slight difference being that, even without a kernel patch to change the original assert from a BUG_ON to a WARN_ON, I get both kernel.log errors; first at rtmutex.c:472 and then rtmutex.c:724. Which correspond to the following lines, with a little preceeding code for context:

rtmutex.c:472 = BUG_ON(rt_mutex_real_waiter(task->pi_blocked_on));

[code]/*

  • Task blocks on lock.

  • Prepare waiter and propagate pi chain

  • This must be called with lock->wait_lock held.
    */
    static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
    struct rt_mutex_waiter *waiter,
    struct task_struct *task,
    int detect_deadlock)
    {
    struct task_struct *owner = rt_mutex_owner(lock);
    struct rt_mutex_waiter *top_waiter = waiter;
    unsigned long flags;
    int chain_walk = 0, res;

    raw_spin_lock_irqsave(&task->pi_lock, flags);

    /*

    • In the case of futex requeue PI, this will be a proxy
    • lock. The task will wake unaware that it is enqueueed on
    • this lock. Avoid blocking on two locks and corrupting
    • pi_blocked_on via the PI_WAKEUP_INPROGRESS
    • flag. futex_wait_requeue_pi() sets this when it wakes up
    • before requeue (due to a signal or timeout). Do not enqueue
    • the task if PI_WAKEUP_INPROGRESS is set.
      */
      if (task != current && task->pi_blocked_on == PI_WAKEUP_INPROGRESS) {
      raw_spin_unlock_irqrestore(&task->pi_lock, flags);
      return -EAGAIN;
      }

    BUG_ON(rt_mutex_real_waiter(task->pi_blocked_on));[/code]

rtmutex.c:724 = BUG_ON(rt_mutex_owner(lock) == self);

[code]
/*

  • Slow path lock function spin_lock style: this variant is very

  • careful not to miss any non-lock wakeups.

  • We store the current state under p->pi_lock in p->saved_state and

  • the try_to_wake_up() code handles this accordingly.
    */
    static void noinline __sched rt_spin_lock_slowlock(struct rt_mutex *lock)
    {
    struct task_struct *lock_owner, *self = current;
    struct rt_mutex_waiter waiter, *top_waiter;
    int ret;

    rt_mutex_init_waiter(&waiter, true);

    raw_spin_lock(&lock->wait_lock);
    init_lists(lock);

    if (__try_to_take_rt_mutex(lock, self, NULL, STEAL_LATERAL)) {
    raw_spin_unlock(&lock->wait_lock);
    return;
    }

    BUG_ON(rt_mutex_owner(lock) == self);[/code]

As suggested, I’ve also tried changing the CriticalSection mutex to remove the PTHREAD_MUTEX_RECURSIVE attribute. With this gone, the JuceDemo doesn’t even get as far as displaying its GUI, which I assume means Juce does currently rely on recursive mutexes. No kernel lock though, which could suggest either that it is an incompatibility with PI aware and recursive mutexes, or (perhaps more likely?) that the offending code no longer gets a chance to run due to an early deadlock.

Any suggestions on how I should proceed? Does anyone know for sure if recursive and priority aware attributes are mutually exclusive? Unless anyone has a better suggestion, perhaps it might be a sensible next step for me to try and replicate this (perhaps forbidden) mutex use in a stand-alone (Juce-less) app.

Actually, I don’t think it is a problem with combining recursive and priority inversion attributes for a mutex. PulseAudio uses this configuration ( see http://cgit.freedesktop.org/pulseaudio/pulseaudio/tree/src/pulsecore/mutex-posix.c ) and behaves itself under a RT kernel. I really thought we were onto something here too: darn it!

So, … any other thoughts, ideas, or suggestions from anyone?

The code from PulseAudio doesn’t show both usage at the same time. I guess it’s the case elsewhere in the code, but I don’t want to spend time searching for it.

Now, I’ve 2 questions:

  • Why are recursive mutex required in Juce ?
  • What approach is there in PA that’s not in Juce (or vice-versa) ?

From the kernel code you’ve posted, it seems that the first bug means that the mutex waiter must be the task with low priority that the task with high priority is blocked upon.
In the second case, it’s kind of the same issue.
To me, from a bird view, it seems that a thread is doing some weird operation with the mutex (like maybe changing the priorities while a mutex is locked).

I wonder if you had time to spot the part of the code in Juce that’s triggering this issue ?
You can find it with the addr2line tip I wrote few post before. You might then be able to spot the code/pattern in Juce that’s causing a deadlock.

Took me a little searching too, but PulseAudio does use recursive PI mutexes in thread-mainloop.c .

Bearing in mind that is seems this combination isn’t necessarily such a bad thing, I think perhaps (for the time being, at least), and in the interest of not performing too much un-needed Juce surgery, I’d rather leave the mutexes recursive and look elsewhere. With Windows CriticalSection being immutably recursive, I’m not sure having Juce’s CriticalSection object perform differently under different systems is such a hot idea. If all else fails, definitely a possibility for investigation, but I think perhaps as a last resort.

I’m afraid my kernel crash stack revealed nothing: not a single address corresponded to any Juce code. I do also have a kernel call trace, but I can’t see anything useful here:

Apr 30 11:22:26 (none) kernel: [ 1440.549114] Call Trace: Apr 30 11:22:26 (none) kernel: [ 1440.549130] [<c01767e1>] ? lookup_pi_state+0x171/0x230 Apr 30 11:22:26 (none) kernel: [ 1440.549152] [<c01799e6>] rt_mutex_start_proxy_lock+0x46/0xa0 Apr 30 11:22:26 (none) kernel: [ 1440.549171] [<c017782a>] futex_requeue+0x44a/0x760 Apr 30 11:22:26 (none) kernel: [ 1440.549192] [<c0178728>] do_futex+0x88/0x8e0 Apr 30 11:22:26 (none) kernel: [ 1440.549212] [<c0179049>] sys_futex+0xc9/0x130 Apr 30 11:22:26 (none) kernel: [ 1440.549256] [<c02184d9>] ? sys_read+0x59/0x70 Apr 30 11:22:26 (none) kernel: [ 1440.549299] [<c0474100>] syscall_call+0x7/0xb

But I think your suggestion is sensible: I’ll approach this from the perspective of the Juce code and see if I can, if possible, narrow it down to a specific set of Juce code or classes, or try and get a callstack from the Juce app. Changing priorities on a locked mutex? Hmm, you could be right: that’s certainly something else for me to look into.

I’ll keep this updated when I’ve delved a little deeper…

I hope I’m not being premature or doing anything silly, but I suspect the latest rtmutex.c patch fixes this. If anyone else is up for some kernel building this weekend, it would be great to have a second source confirm or deny this.

On my Arch Linux test machine, which I could previously lock up, always and instantly, by simply running several instances of JuceDemo, I built the latest rt kernel (specifically, linux-rt 3.2.16_rt27-1 from https://aur.archlinux.org/packages/li/linux-rt/linux-rt.tar.gz) , which includes this patch :

[code]
— lock_rtmutex.c 2012-04-30 11:35:45.000000000 +0100
+++ rtmutex.c 2012-05-11 18:36:17.000000000 +0100
@@ -75,7 +75,8 @@

static int rt_mutex_real_waiter(struct rt_mutex_waiter *waiter)
{

  • return waiter && waiter != PI_WAKEUP_INPROGRESS;
  • return waiter && waiter != PI_WAKEUP_INPROGRESS &&
  •   waiter != PI_REQUEUE_INPROGRESS;
    

}

/*
@@ -1289,7 +1290,7 @@

debug_rt_mutex_init(lock, name);

}
-EXPORT_SYMBOL_GPL(__rt_mutex_init);
+EXPORT_SYMBOL(__rt_mutex_init);

/**

  • rt_mutex_init_proxy_locked - initialize and lock a rt_mutex on behalf of a
    @@ -1353,6 +1354,35 @@
    return 1;
    }

+#ifdef CONFIG_PREEMPT_RT_FULL

  • /*

    • In PREEMPT_RT there’s an added race.
    • If the task, that we are about to requeue, times out,
    • it can set the PI_WAKEUP_INPROGRESS. This tells the requeue
    • to skip this task. But right after the task sets
    • its pi_blocked_on to PI_WAKEUP_INPROGRESS it can then
    • block on the spin_lock(&hb->lock), which in RT is an rtmutex.
    • This will replace the PI_WAKEUP_INPROGRESS with the actual
    • lock that it blocks on. We must not place this task
    • on this proxy lock in that case.
    • To prevent this race, we first take the task’s pi_lock
    • and check if it has updated its pi_blocked_on. If it has,
    • we assume that it woke up and we return -EAGAIN.
    • Otherwise, we set the task’s pi_blocked_on to
    • PI_REQUEUE_INPROGRESS, so that if the task is waking up
    • it will know that we are in the process of requeuing it.
  • */

  • raw_spin_lock_irq(&task->pi_lock);

  • if (task->pi_blocked_on) {

  •   raw_spin_unlock_irq(&task->pi_lock);
    
  •   raw_spin_unlock(&lock->wait_lock);
    
  •   return -EAGAIN;
    
  • }

  • task->pi_blocked_on = PI_REQUEUE_INPROGRESS;

  • raw_spin_unlock_irq(&task->pi_lock);
    +#endif

  • ret = task_blocks_on_rt_mutex(lock, waiter, task, detect_deadlock);

    if (ret && !rt_mutex_owner(lock)) { [/code]
    This machine has now been running for over an hour, solid as a rock, with JuceDemo instances being intensely tested, and with no sign of any kernel lock-ups. And I’ve double checked that I’ve accidentally not changed anything else: I’m using an unpatched Juce branch and have still got PI mutexes enabled. Uname confirms that I’m definitely still running an RT kernel, specifically “Linux (none) 3.2.16-rt27-1-rt #1 SMP PREEMPT RT”.

Unless I’m missing something obvious, or kernel changes have made an easily hit race condition more obscure, I’d say that this was a kernel bug all along.

I have upgraded my RT kernel based on the comments above from CPB, and can confirm that I too do not see the kernel crashes anymore. Looks like we’ve got past this problem.

FYI, I upgraded to 3.2.17:
AMD64 3.2.17-rt28 #1 SMP PREEMPT RT Sat May 19 00:40:04 WST 2012 x86_64 AMD Phenom™ II X6 1055T Processor AuthenticAMD GNU/Linux

Thanks to everyone for looking at this!

Great, thanks for the update! I’ve had another tester confirm that they too no longer experience this freeze since they installed the new RT kernel. Three independent confirmations gives me confidence that this is, indeed, fixed.

Great. I feared you’d made removing the PI feature which I absolutely needed.

for the record, there is a lot of serious misinformation in this thread, and for posterity’s sake, i’d like to correct it.

  1. the bug here was a kernel bug, not anything in JUCE or specific plugins. This became clear by the end of the thread, but that should have been clear from the outset: user space code CANNOT cause kernel crashes without the presence of a kernel bug. This is just a matter of definition.

  2. it is absolutely NOT true that it is necessary to enable priority inheritance on pthread mutexes in order to avoid deadlocks. code that would deadlock in this fashion without priority inheritance is almost always incorrectly written. there is a tool named “lockdep” which can be used to analyse locking patterns and will identify lock dependencies like the one described by XRyl669. when they are discovered, they should be fixed. if you are writing code and find yourself considering lock dependencies like this, go back to the drawing board, because something has gone wrong.

  3. recursive mutexes are generally considered extremely poor design. if you don’t believe me, start here: http://zaval.org/resources/library/butenhof1.html TL;DR: "Recursive mutexes can be a great tool for prototyping thread support in an existing library, exactly because it lets you defer the hard part: the call path and data dependency analysis of the library. But for that same reason, always remember that you’re not DONE until they’re all gone, so you can produce a library you’re proud of, that won’t unnecessarily contrain the concurrency of the entire application. "

  4. taking locks in a realtime thread is nearly always a sign of the wrong design. from a quick scan of the JUCE code in question, it is not clear why JUCE is not using a ringbuffer (a.ka. FIFO) which is lockfree for single-reader/single-writer use, rather than a mutex. this may be too superficial of a scan of the code, so i apologize if i’ve missed something.

btw, huge fan of JUCE. wish we had used it for Ardour from the start (which would have benefitted both Ardour and JUCE, I think)

Thanks @dawhead for your clarifications. It is a really interesting thread to read, but mostly barking up the wrong tree.

This is crucial and important. If you in any way manage to crash the kernel from userland, the problem exists in the kernel and that is where you should debug. Never blame user applications for causing kernel problems.

Very interested in a follow up on this from Jules and others.

Never Too Late To Do The Right Thing? :wink:

Very interested in a follow up on this from Jules and others.[/quote]

I’m not actually sure which locks you’re talking about there…?

It is of course very good advice in general. But there are places where I’d say it’s ok - e.g. I’ve used a lock in the device manager, to lock access to the the source that’s currently being played. In that case, other threads will only grab the lock for a few microseconds while they change a pointer, and they’ll only do this very rarely. Under those conditions, the chances of the realtime thread actually getting interrupted are so infinitesimal that it just wouldn’t justify a more complicated solution. (And in the case of the device manager, even if there was a tiny glitch when playback starts or stops, nobody would hear it). But if you find places where you do think I’ve mis-used a lock, do let me know!

Well, read this:

and why PI solves it (with the usual no-warranty sign):

You can’t fix other software/library code so you have to deal with it.