Interesting OpenGL discovery

I’ve discovered something interesting which could be useful for those using OpenGL (at least on macOS).

To query the time spent within my render callback I placed glBeginQuery(GL_TIME_ELAPSED, queryID) at the beginning of the callback and glEndQuery(GL_TIME_ELAPSED) followed by glGetQueryObjectui64v with GL_QUERY_RESULT at the end. This apparently minor change had a massively positive effect!

Before this change, as soon as the load on the GPU got too large, the framerate would suddenly drop to about one fps and the mouse become almost unresponsive. Now the framerate decreases gradually and the mouse and system in general remain as responsive as ever, no matter the load on the GPU.

Does anybody know what could be going on or if this has some negative side effects which I haven’t noticed yet?

2 Likes

As an old OpenGL hacker, for me this is an interesting finding which could have a number of different causes - the first one that pops to mind is that the OpenGL driver is optimising itself when it detects that you’re attempting to measure performance! :slight_smile: A dark pattern? Perhaps.

The performance is probably smoother because the driver, having detected this query object, is not buffering GPU commands as aggressively, resulting in smoother throughput because the GPU/driver is pacing itself - without this query object, you may be overwhelming the CPU/GPU synchronization pipeline with commands. Things get laggy because the GPU is struggling to process its pipeline and isn’t yielding to the CPU to handle UI events in time.

Leave it in? Probably not wise, it’ll cover up a real performance issue in your code. Take it out, and work out what you’re doing that is causing a huge pileup of GPU commands, somewhere …

One thing that is curious: does the behaviour change when you have power versus on a battery? Are you testing on a desktop system, or a laptop? What other power-management events may be impacting the CPU/GPU synchronicity in non-optimum ways …

Please note that these kinds of metrics are not broadly applicable to all hardware configurations, and it is therefore dangerous to make an assessment on the basis of a single instance. It would be interestiing if you could do the same performance measurements across a variety of hardware …

2 Likes

Thanks for the reply!

Yes, some sync issue seems to be at the heart of this. This explanation seems plausible (AI-generated):

"Before adding the query, your rendering loop might have been implicitly synchronizing with the GPU in a way that caused sudden stalls.

When the GPU gets overloaded, certain OpenGL calls (like glBufferSubData, glMapBuffer, or even glDrawElements) can block the CPU until the GPU catches up.

GL_TIME_ELAPSED queries introduce a form of implicit pipelining by allowing the GPU to finish rendering asynchronously before retrieving the result in a later frame.

This reduces the chance of a hard sync point where the CPU waits for the GPU, leading to a more gradual degradation of performance instead of sudden hitches."

Perhaps this points to something which could be improved within JUCE’s OpenGL implementation?

I’m working on a general-purpose GPU class, which I want to be able to handle anything the user throws at it. I know exactly what is causing the huge pileup of GPU commands – I am causing it intentionally!

I’m impressed by how much of a positive difference the function makes. OpenGL went from potentially slowing the entire system to being completely robust, and I can’t notice any negative effects under normal circumstances. Also, the function lets me measure the GPU load quite precisely, which is the reason I put it in there to begin with.

Without it, not only might the system suddenly get clogged up, I also don’t have a reliable way to know when that is about to happen.

I will try to test on more hardware. For now I have just tested on an (old) Intel MacbookPro. Whether on battery or plugged makes no difference. GL_TIME_ELAPSED is not supported on OpenGL ES, so I can’t test on mobile devices.

Have you used something like NVidia Nsight to get fine granularity on the timing differences with and without those function calls? It might help identify an inefficiency in your pipeline which can be fixed so it works without those query calls.

2 Likes

No, thanks for the suggestion!

It sounds like an implicit flush of the pipeline, which could create the wrong impression of performing better when in reality it’s just a coincidence that it appears to be smoother on your platform.
What is odd here is that from your description it seems to be only glBeginQuery, glEndQuery, which on their own shouldn’t trigger any flushing or change, since the result stays on the GPU side until you use glGetQueryObject with GL_QUERY_RESULT.

My suggestion is to actually read them and confirm the GPU times.

Ideally first GL_QUERY_RESULT_AVAILABLE and when true GL_QUERY_RESULT_NO_WAIT, to avoid this flushing and waiting. Which means you should have some queue of query objects to asynchronously read when ready. (glGetQueryObject - OpenGL 4 Reference Pages)

From my experience the actual render time, PCIe transfer and amount of draw calls is so minimal, the only way to explain the sluggish rendering of JUCE OpenGL implementation at times is due to many implicit flushes caused by some specific command. It is CPU bound.

Either massive sync due to the big path vertex buffer set data not being orphaned (Buffer Object Streaming - OpenGL Wiki)
or due to frequent pixel transfers (cpu image to gpu upload). Or any glGet function that queries state that is only known AFTER rendering a frame. Therefore implicit flush at that time, waiting up to multiple frames, explaining high framerate fluctuation.

The fatal thing is that it can really be a single command at the wrong time to degrade performance causing entire frames to be missed for “present” explaining the FPS drop, even though the actual render performance should be laughable small.

I also advocate for using GL debuggers. RenderDoc, NVidia Nsight or even NVidia Systems. Nvidia Systems is especially interesting, cause you see VBlank, the time of queing commands and the presen in relation to each other.

2 Likes

Yes, I am using that too, well spotted. I will edit my post above and include it.

No luck with those, I’m on macOS.

It may or may not be related, but by calling DwmFlush at the right times on windows you get a bizarrely dramatic drop in CPU usage. It traces to SwapBuffers doing some sort of busy-wait or otherwise falsely attributing a wait as CPU usage. The flush just replaces that with wait time, thus “removing” the CPU consumption.

My guess is this could be a similar situation on a different platform.

Interesting. But that just affects how the CPU usage is being reported, right?

On my tests on both cases (with and without the time query) the CPU usage always remains low.

In your experience how does Windows respond if one does some very heavy GPU (but not CPU) processing in an OpenGL component? Does the responsiveness of the entire system (resizing windows, moving the mouse around, etc) usually remain unaffected?

I don’t recall it freezing the entire system, it only impacted the OpenGL thread which calls swap buffers.

1 Like