OpenGL deactivateCurrentContext Windows CPU

Hello,

Using OpenGL to render components causes a lot of calls to deactivateCurrentContext(), which on Windows runs wglMakeCurrent(0, 0). This is an expensive call.

It looks like this call is made unconditionally after rendering each frame of each component, then again before the next. It should only be necessary when setting up a render, not when shutting it down, because it never needs to be cleared: wglMakeCurrent function (wingdi.h) - Win32 apps | Microsoft Docs.

Also, there is no checking to see if this call is necessary; it seems JUCE is always sharing the context, but the thread is changing. If the last thread/context made active is the same as the current, there would be no need to call it. It seems a quick check would be worthwhile? Would it be better to use a single thread? That’s too much of a patch for me to try and put together without devoting major time, sadly.

Has anybody come up with a patch or solution to this problem? I have searched a lot and see a few other people mention this, but no public solutions.

-Sam

1 Like

So perhaps it is necessary to deactivate in the current design, but according to this at least, lots of calls to wglMakeCurrent is unnecessary and indicates design could be improved. Salient quote:


If you use multiple threads but a single GL context, you must push around the context from thread to thread by making it uncurrent in some thread, and making it current again in the new thread, and so on. But that scheme should almost never be necessary. Fur mutli-threaded GL, you should create mutlipe shared contexts, and then, you usually need one wglMakeCurrent per thread.

I’ll look into https://www.khronos.org/registry/OpenGL/extensions/KHR/KHR_context_flush_control.txt to see if it can be used to improve the situation.

Are you sure that the context is actually used from multiple threads?

First of all, previous knowledge:

The device context HDC and the render context HGLRC are created on the message thread.
juce_opengl\native\juce_OpenGL_win32.h : OpenGLContext::NativeContext

Then a threadpool with one job acquires the MessageManager lock, per frame, and renders the components. → juce_opengl\opengl\juce_OpenGLContext.cpp : OpenGLContext::CachedImage::renderFrame()

So at least two threads are involved. It was abstracted this way because Android (and iOS?) require an extra render thread.

Now, I could have sworn, last time I checked it was necessary to call both wglMakeCurrent(dc, renderContext) and wglMakeCurrent(nullptr, nullptr) per frame.
But looking at it now with some debug messages:

static void deactivateCurrentContext()  
{
    DBG("Thread ID " + String((intptr_t)Thread::getCurrentThreadId()) + " : wglMakeCurrent (nullptr, nullptr)");
    wglMakeCurrent (nullptr, nullptr);
}
   
bool makeActive() const noexcept        
{
    DBG("Thread ID " + String((intptr_t)Thread::getCurrentThreadId()) + " :  wglMakeCurrent (dc, renderContext)");
    return isActive() || wglMakeCurrent (dc, renderContext) != FALSE;
}

The output is this:

25448 = Message Thread
29580 = GL Render Thread

Thread ID 25448 :  wglMakeCurrent (dc, renderContext)
Thread ID 25448 : wglMakeCurrent (nullptr, nullptr)

Thread ID 29580 :  wglMakeCurrent (dc, renderContext)
Thread ID 29580 : wglMakeCurrent (nullptr, nullptr)
Thread ID 29580 :  wglMakeCurrent (dc, renderContext)
Thread ID 29580 : wglMakeCurrent (nullptr, nullptr)
Thread ID 29580 :  wglMakeCurrent (dc, renderContext)
Thread ID 29580 : wglMakeCurrent (nullptr, nullptr)
Thread ID 29580 :  wglMakeCurrent (dc, renderContext)
Thread ID 29580 : wglMakeCurrent (nullptr, nullptr)
...

Even on window resize or close, after the initial setup of 25448, it’s only used by 29580. Which means all the acquisition is unnecessary, right? At least it seems that way, or are there cases where the message thread could/must suddenly intervene without waiting?

Now, one could argue: “But what about multiple windows!?”
Well, you can only attach one OpenGLContext to one Component at a time. And it’s mostly the TopLevelWindow. If multiple windows are used, and no sharing is required, each root component can and should use its own OpenGLContext.

The question is, when can multiple threads call wglMakeCurrent with the same render context? Is this really possible right now? Is it only done this way for the rare case of sharing context resources?

@reuk

2 Likes

@parawave Since posting this I have been digesting your comments in many other threads, thank you for your research into all things render, vulkan and otherwise!

Most posts agree that one context is the way to go (within a single window), even when you render your own stuff in renderOpenGl(). @fr810 @PluginPenguin @timart have all touched on this in the past, each providing code for classes that handle sharing the OpenGLContext with locking, clipping, and sending things to the right thread. This understanding isn’t easy to grok without a fair amount of digging and doesn’t really hit in the tutorial.

Given your debug output it seems to me like the experiment to do (on Windows, at least) is skip wglMakeCurrent(null, null), as I can’t see a difference between clear/activate and simply activate. Then you can put a conditional around the wglMakeCurrent() in activate based on the last thread ID, which could I guess be stored in an atomic int that starts off at -1? That’s something I may actually be able to patch in reasonable time. If I get a chance to try I will report back (pass/fail/new understanding).

My other performance thoughts come from the message thread locking issue touched on by @chkn. It seems their suggestion is to draw paths in software on its own thread and use the cached images from that in the openGL thread, with the goal of decoupling OGL from message. That’s going to take some time to understand if it’s worthwhile before I dive into it.

2 Likes

I actually tested it by just commenting out the OpenGLContext::deactivateCurrentContext(); after
context.swapBuffers(); and by adding a check to context.makeActive(), skipping the makeCurrent if it’s already active.

Sadly, there isn’t much of a difference. The thing is, wglMakeCurrent() is somehow shrouded in fog.
It’s not clear what exactly happens, since it triggers all kinds of driver level stuff. It implicitly flushes the previously submitted commands (command buffers), so it’s probably more realistic to glFlush() or glFinish() before measuring. And measuring it in debug is also not a good idea. It’s complicated.

I mentioned this in BR: OpenGL is using a lot of CPU - #7 by parawave

If you test this with render code that actually does something, not empty, the used CPU by the context switch will balance itself out. And it’s connected to V-Sync (swap interval) and frametime.

If someone wants to dive into this, I recommend looking at glfw, SDL or SFML and look at how they set up the wgl calls. Perhaps there is some magic GL extension that solves this?

Anyway. All of this could improve it ‘a bit’. Probably. But in the end, if you really want to boost the render performance, this will hardly change anything.

The real bottleneck sits in JUCE/juce_OpenGLGraphicsContext.cpp at master · juce-framework/JUCE · GitHub

L885 : struct EdgeTableRenderer and it’s use

template <typename IteratorType>
void add (const IteratorType& et, PixelARGB colour)
{
EdgeTableRenderer<ShaderQuadQueue> etr (*this, colour);
et.iterate (etr);
}

This is used by path and image drawing. Converting scanlines to pixel quads. The actual generation of the edge table is performed by the same code used by the SoftwarRenderer, namely RenderingHelpers::SavedStateBase in

The IteratorType can be of type:

using EdgeTableRegionType = typename ClipRegions<SavedStateType>::EdgeTableRegion;

using RectangleListRegionType = typename ClipRegions<SavedStateType>::RectangleListRegion;

Basically the RectangleList will end up with a single quad for a fill. But for everything else. Paths, transformed images, gradients and stuff, or as soon as clipping is involved. The expensive EdgeTables will be used.

All of this is done on CPU, not multithread and without caching. I have to say, it’s a really cool and elegant solution. But it’s not suited for GL. No blame here. It’s obviously not a trivial thing. Even Skia does all kind of hacky stuff with different implementations to give minor boost for the worst case scenarios.

I wonder if it’s possible to create some kind of intermediate representation. Skip the scanline edgetable rendering on CPU and do all of this in a shader. Perphas vertex or geometry shader that creates the edge tabe on the fly. This would give a massive boost.

2 Likes

Hi @parawave just a reminder that there is another call to wglMakeCurrent() via the checkError function at juce_opengl.cpp line 193. If you comment out that body, you should see improvements.

1 Like

Are you sure? Didn’t see any call made to wglMakeCurrent in that function?
But it shouldn’t matter anyway. I printed out the use in OpenGLContext::NativeContext::makeActive and OpenGLContext::NativeContext::deactivateCurrentContext, which are the only functions that call wglMakeCurrent. So it’s definitely not called in that loop. Be sure to test in release and with actual render load and compare the GPU/CPU used.

@parawave @Fandusss in my case I’m doing some fairly heavy fragment shading: blur, bloom, particle effects, texture reading, but the edge table rendering is taking 7% and deactivateCurrentContext showing a whopping 31%. This is in RelWithDebugInfo.

So, I made @parawave 's changes and indeed noticed the time moved to glGetError, which does not call deactivateCurrentContext.

So, I made @Fandusss 's changes by straight up nop-ing clearGLError. It doesn’t do anything that looks time consuming as far as I can tell, but heck, why not? And… the time moved to bindVertexArray, which doesn’t do much on the C++ side either.

I think it’s likely the flush @parawave mentions that is actually taking up time, but I can’t see because it would happen internally in the driver inside one of the gl* calls. This makes sense:

  • wglMakeCurrent flushes before switching. We know this because of the context_flush_control extension that disables this behavior.

  • glCheckError likely flushes as well; it’s got to run all commands on the GPU up to the point in the code where you call this function, it would give you a stale result otherwise.

  • bindVertexArray likely flushes because, well, finish up before changing the data underneath.

My understanding is limited, but it seems like this is hard to debug without looking into a graphics driver that has no debug symbols for us and is gonna be really specific to the machine. So, if @parawave is right, I’m guessing it’s not time spent in the C++ edge table code per-se (mine only shows 7% ish still, even with all these changes). If JUCE eventually uses the edge tables to pass tons of scanlines as verticies, it’s not surprising that’s the call that shows the heavy load. It’s also not surprising then that @chkn is suggesting for high framerates, just render out components in software on a different thread and then use the resulting images directly in OpenGL as a texture, either.

I still wonder why changing a downstream line would cause the performance load to move to the “right place”; perhaps it’s flushing the operations that rely on the previous bind before loading the next, and removing all other obstacles makes this the only call that requires a flush in the control flow?

While I’m not 100% convinced I know what’s going on, I’m temporarily taking my mind off of deactivateCurrentContext() because I’ve measured now that it’s not impacting overall performance.

Same issue here.

Currently, swap interval isn’t respected and the plugin always hogs one CPU core 100% - even without anything to display. Profiler says it’s the deactivate current context issue. MSVC/Windows/x86_64.