SIGSEGV in postMessageToSystemQueue callstack

By suspending/resuming the app a number of times I can always generate a crash:



I/DEBUG   (  294): *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
I/DEBUG   (  294): Build fingerprint: 'lge/geehrc_open_eu/geehrc:4.4.2/KOT49I.E97520a/E97520a.1403273885:user/release-keys'
I/DEBUG   (  294): Revision: '11'
I/DEBUG   (  294): pid: 3543, tid: 3919, name: Thread-25129  >>> se.dirac.myappl <<<
I/DEBUG   (  294): signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr fffffff0
I/DEBUG   (  294):     r0 650fd460  r1 1d2004ae  r2 57b4a908  r3 62d3da40
I/DEBUG   (  294):     r4 61396b5c  r5 65586940  r6 00000000  r7 61b8df44
I/DEBUG   (  294):     r8 62d3db10  r9 61b8df3c  sl 650fd018  fp 62d3da1c
I/DEBUG   (  294):     ip fffffff0  sp 62d3da08  lr 60a2e6ec  pc fffffff0  cpsr 20070010
I/DEBUG   (  294):     d0  4419c00043bfc000  d1  c419c000c419c000
I/DEBUG   (  294):     d2  43c0000043c00000  d3  4128000041a2aaab
I/DEBUG   (  294):     d4  000000004145200d  d5  457c000043c00000
I/DEBUG   (  294):     d6  43c0000000000001  d7  3f8000004419a000
I/DEBUG   (  294):     d8  4410800000000242  d9  0000000043873e00
I/DEBUG   (  294):     d10 0000000000000000  d11 0000000000000000
I/DEBUG   (  294):     d12 0000000000000000  d13 0000000000000000
I/DEBUG   (  294):     d14 0000000000000000  d15 0000000000000000
I/DEBUG   (  294):     d16 0000030000000000  d17 000004ce00000000
I/DEBUG   (  294):     d18 725f64656c706d61  d19 6f745f7265646e65
I/DEBUG   (  294):     d20 657275747865745f  d21 5f5458455f4c4720
I/DEBUG   (  294):     d22 75625f726f6c6f63  d23 6f6c665f72656666
I/DEBUG   (  294):     d24 36c3e9434058af91  d25 3fb10d8904a2188e
I/DEBUG   (  294):     d26 bfadde5fcf28f884  d27 bf56c087e80f1e27
I/DEBUG   (  294):     d28 0100010001000100  d29 0100010001000100
I/DEBUG   (  294):     d30 ffffffffffffffff  d31 ffffffffffffffff
I/DEBUG   (  294):     scr 20000013
I/DEBUG   (  294):
I/DEBUG   (  294): backtrace:
I/DEBUG   (  294):     #00  pc fffffff0  <unknown>
I/DEBUG   (  294):     #01  pc 0067f6e8  /data/app-lib/se.dirac.myappl-1/libjuce_jni.so (_JNIEnv::CallVoidMethodV(_jobject*, _jmethodID*, std::__va_list)+56)
I/DEBUG   (  294):     #02  pc 0067f9a4  /data/app-lib/se.dirac.myappl-1/libjuce_jni.so (juce::GlobalRef::callVoidMethod(_jmethodID*, ...) const+60)
I/DEBUG   (  294):     #03  pc 00b64e14  /data/app-lib/se.dirac.myappl-1/libjuce_jni.so (juce::MessageManager::postMessageToSystemQueue(juce::MessageManager::MessageBase*)+76)
I/DEBUG   (  294):     #04  pc 00b5e748  /data/app-lib/se.dirac.myappl-1/libjuce_jni.so (juce::MessageManager::MessageBase::post()+72)
I/DEBUG   (  294):     #05  pc 00b5f08c  /data/app-lib/se.dirac.myappl-1/libjuce_jni.so (juce::MessageManagerLock::attemptLock(juce::Thread*, juce::ThreadPoolJob*)+332)
I/DEBUG   (  294):     #06  pc 00b5eea4  /data/app-lib/se.dirac.myappl-1/libjuce_jni.so (juce::MessageManagerLock::MessageManagerLock(juce::Thread*)+44)
 

Any ideas why this happens ? I'm using OpenGL and in JUCEApplication::suspend I detach and deactivate the OpenGL context, and reactivate and attach it on JUCEApplication::resume...

I am banging my head against this one as well (as it is often difficult to reproduce on the hardware I have, but not apparently for others).  I've been experimenting with detaching and reattaching the GL context on suspend and resume, sometimes after some timer delays to help mitigate things.

 

The  message manager lock is taken during every opengl context render() call, and I wonder if the thread context it is called from affects this, especially since it is calling out to the Java end during the crash.  I can't say I fully understand what the message manager stuff is doing with the conditionals related to thread context, and how the java runtime needs to be involved here.

Out of curiosity robiwan, what code are you using to detach/attach and deactivate/reactivate in suspend and resume?

Also, perhaps this post has a clue as well:

  http://www.juce.com/comment/309350#comment-309350

Perhaps the fact that the openGL thread context from the java appears to be a different thread than the thread associated with the main JNIEnv is the issue?  Hinted at here at the beginning of the dumps that I've been getting:

 

JNI DETECTED ERROR IN APPLICATION: thread Thread[28,tid=9702,Native,Thread*=0xb4e0b800,peer=0x12c8f5b0,"GLThread 13562"] using JNIEnv* from thread Thread[22,tid=9699,Runnable,Thread*=0xaee0f800,peer=0x12ef0140,"Thread-13561"]

F/art     ( 9329): art/runtime/check_jni.cc:65]     in call to CallVoidMethodV^M

F/art     ( 9329): art/runtime/check_jni.cc:65]     from void com.sonosaurus.tonalenergytunertest.TETActivity$OpenGLView.render()^M

F/art     ( 9329): art/runtime/check_jni.cc:65] "GLThread 13562" prio=5 tid=28 Runnable^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   | group="main" sCount=0 dsCount=0 obj=0x12c8f5b0 self=0xb4e0b800^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   | sysTid=9702 nice=0 cgrp=apps sched=0/0 handle=0xafe01580^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   | state=R schedstat=( 25892970 20870624 65 ) utm=0 stm=2 core=3 HZ=100^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   | stack=0x9d2c2000-0x9d2c4000 stackSize=1036KB^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   | held mutexes= "mutator lock"(shared held)^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   native: #00 pc 00004640  /system/lib/libbacktrace_libc++.so (UnwindCurrent::Unwind(unsigned int, ucontext*)+23)^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   native: #01 pc 00002e8d  /system/lib/libbacktrace_libc++.so (Backtrace::Unwind(unsigned int, ucontext*)+8)^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   native: #02 pc 0023f45d  /system/lib/libart.so (art::DumpNativeStack(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, int, char const*, art::mirror::ArtMethod*)+68)^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   native: #03 pc 0022461f  /system/lib/libart.so (art::Thread::Dump(std::__1::basic_ostream<char, std::__1::char_traits<char> >&) const+146)^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   native: #04 pc 000af913  /system/lib/libart.so (art::JniAbort(char const*, char const*)+582)^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   native: #05 pc 000b004d  /system/lib/libart.so (art::JniAbortF(char const*, char const*, ...)+60)^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   native: #06 pc 000b30c5  /system/lib/libart.so (art::ScopedCheck::ScopedCheck(_JNIEnv*, int, char const*)+1132)^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   native: #07 pc 000ba617  /system/lib/libart.so (art::CheckJNI::CallVoidMethodV(_JNIEnv*, _jobject*, _jmethodID*, std::__va_list)+30)^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   native: #08 pc 004442c4  /data/app/com.sonosaurus.tonalenergytunertest-1/lib/arm/libjuce_jni.so (_JNIEnv::CallVoidMethodV(_jobject*, _jmethodID*, std::__va_list)+56)^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   native: #09 pc 0044430c  /data/app/com.sonosaurus.tonalenergytunertest-1/lib/arm/libjuce_jni.so (juce::GlobalRef::callVoidMethod(_jmethodID*, ...) const+60)^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   native: #10 pc 006bcfa0  /data/app/com.sonosaurus.tonalenergytunertest-1/lib/arm/libjuce_jni.so (juce::MessageManager::postMessageToSystemQueue(juce::MessageManager::MessageBase*)+68)^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   native: #11 pc 006b63b0  /data/app/com.sonosaurus.tonalenergytunertest-1/lib/arm/libjuce_jni.so (juce::MessageManager::MessageBase::post()+72)^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   native: #12 pc 006b6d5c  /data/app/com.sonosaurus.tonalenergytunertest-1/lib/arm/libjuce_jni.so (juce::MessageManagerLock::attemptLock(juce::Thread*, juce::ThreadPoolJob*)+340)^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   native: #13 pc 006b6b74  /data/app/com.sonosaurus.tonalenergytunertest-1/lib/arm/libjuce_jni.so (juce::MessageManagerLock::MessageManagerLock(juce::Thread*)+44)^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   native: #14 pc 0093ca78  /data/app/com.sonosaurus.tonalenergytunertest-1/lib/arm/libjuce_jni.so (juce::OpenGLContext::CachedImage::renderFrame()+180)^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   native: #15 pc 0093f194  /data/app/com.sonosaurus.tonalenergytunertest-1/lib/arm/libjuce_jni.so (juce::OpenGLContext::NativeContext::renderCallback()+100)^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   native: #16 pc 0093c074  /data/app/com.sonosaurus.tonalenergytunertest-1/lib/arm/libjuce_jni.so (Java_com_sonosaurus_tonalenergytunertest_TETActivity_00024OpenGLView_render+56)^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   native: #17 pc 00001b1b  /data/dalvik-cache/arm/data@app@com.sonosaurus.tonalenergytunertest-1@base.apk@classes.dex (Java_com_sonosaurus_tonalenergytunertest_TETActivity_00024OpenGLView_render__+82)^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   at com.sonosaurus.tonalenergytunertest.TETActivity$OpenGLView.render(Native method)^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   at com.sonosaurus.tonalenergytunertest.TETActivity$OpenGLView.onDrawFrame(TETActivity.java:639)^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   at android.opengl.GLSurfaceView$GLThread.guardedRun(GLSurfaceView.java:1531)^M

F/art     ( 9329): art/runtime/check_jni.cc:65]   at android.opengl.GLSurfaceView$GLThread.run(GLSurfaceView.java:1248)^M

F/art     ( 9329): art/runtime/check_jni.cc:65] ^M


(this is on Android 5.0 on a samsung Note4 dev device) 

Ah, and the thread 13651 referred to in the error message which the JNIEnv was attached was the audio thread (OpenSL-based).  Somehow the calls to the JVM from the message manager within the opengl thread was trying to use the wrong JNIEnv.  It looks like android::getEnv() (used in callVoidMethod, etc) should be returning an appropriate cached or created JNIEnv based on the calling thread, but something weird is clearly going on here.

 

Just rambling, in case it helps others in tracking this issue down...

Atm I use:


        m_context.detach();
        m_context.deactivateCurrentContext();

and for resume:


        m_context.makeActive();
        m_context.attachTo(*this);

where m_context is the OpenGLContext and this is the main app window.

 

OK, I think I've solved it!

The problem is that the JVM is sometimes re-using pthread contexts for the openGL thread used for the callbacks, but with a different JNI environment.  JUCE's ThreadLocalJNIEnvHolder (used by android::getEnv()) is caching JNIEnv's based on pthread_self() and so sometimes (often) the cached value will be invalid in the opengl case.  I added the following public method to ThreadLocalJNIEnvHolder in juce_core/native/juce_android_JNIHelpers.h:

void uncacheCurrentEnv() noexcept
{
    const pthread_t thisThread = pthread_self();

    SpinLock::ScopedLockType sl (addRemoveLock);

    for (int i = 0; i < maxThreads; ++i)
    {
        if (threads[i] == thisThread)
        {
            envs[i] = nullptr;
            threads[i] = 0;
        }
    }
}

And in juce_opengl/native/juce_OpenGL_android.h in the contextCreated JNI callback (line 162) now looks like this:

JUCE_JNI_CALLBACK (GL_VIEW_CLASS_NAME, contextCreated, void, (JNIEnv* env, jobject view))
{
    // need to clear any existing cached in case this thread was reused by the JVM
    threadLocalJNIEnvHolder.uncacheCurrentEnv();
    threadLocalJNIEnvHolder.getOrAttach();

    if (OpenGLContext::NativeContext* const context = OpenGLContext::NativeContext::findContextFor (env, view))
        context->contextCreatedCallback();
    else
        jassertfalse;
}

 

I wonder if this caching could be inviting issues in other parts of the system, if the Android JVM is behaving this way?

Interesting, thanks! I've committed some changes that should do the same job - hopefully it'll work for you, let me know if it doesn't!

Thanks Jules, your commit should work just fine!