Direct2D Part Deux : 2 Direct 2 Furious

Hey everyone-

I revisited the Windows Direct2D renderer recently and I think it’s in pretty good shape now; see the previous thread for more context.

I was stuck on the window resizing before, but I believe I’ve finally found a good solution. The trick was to use Direct2D on top of DirectComposition. I found this forum posting on gamedev.net:

DXGI Flip Model Flickering During Live Resize

My thanks to “jbatez”, the original poster; that was very helpful.

If you’d like to check it out, I made a music visualizer VST plugin that can switch between software rendering and Direct2D. Here’s the repository with build instructions and examples:

Direct2D Demo Plugin

How well does it work?

Pretty well. CPU usage is much lower in Direct2D mode and text looks very nice. Plus, you can paint on-demand from a timer callback, VBlankAttachment callback, or a dedicated thread, which makes a big difference for animation.

What’s different now?

The Direct2D renderer is now disabling the redirection surface used for old-school GDI painting and is instead rendering to a DirectComposition visual that fills the entire window.

The Direct2D renderer requires Direct2D 1.2 (Windows 8.1 or later).

Other changes:

  • Implemented partial window repainting using dirty rectangles
  • DPI scaling is now done by the Direct2D device context
  • Fixed restoring the Direct2D color brush opacity when restoring the saved state
  • Added support for variable refresh rate displays (https://learn.microsoft.com/en-us/windows/win32/direct3ddxgi/variable-refresh-rate-displays)
  • Fixed drawGlyph with gradient brush
  • Each window now has its own DirectX factory to avoid the global Direct2D lock (cleaned up all sorts of lock contention delays)

Is it a big improvement over the software renderer?

The software renderer is great; you can run at 60 FPS in a 1000x1000 window and it keeps up just fine. Here’s a screenshot of the plugin doing just that:

However, the software renderer starts to struggle with a larger window; for example, here’s 3440x1350 window at 60 FPS:

Zooming in on the statistics in the lower left corners shows:

Software-renderer-big-stats

It’s hitting around 50 FPS; rendering each frame is taking about 90% of the frame time on average. That means the message thread is mostly busy painting.

Now here’s the same window in Direct2D mode:

Direct2D-big-60FPS-stats

It’s easily keeping up at 60 FPS; rendering is now taking about 11% of each frame.

And here’s Direct2D at 120 FPS:

Direct2D-big-120FPS

Performance will, of course, be highly dependent on your CPU & GPU.

That’s just a single window. What if you have lots of windows at once?

Good question. Here’s 20 instances of the plugin rendering butter-smooth Direct2D at 100 FPS in the JUCE plugin host:

Zooming in the stats:

The frame rate could probably go higher; in this case, the the threaded renderer is driven by the WASAPI 10 msec block size, so 100 Hz is the practical limit.

The software renderer in this case will handle about 30 FPS and cannot keep up at 60 FPS.

You can also get 60 FPS with Direct2D painting on the message thread painting in the VBlankAttachment callback, but it’s entirely possible to clog the message thread and render the app unresponsive.

Still seems like it’s using a lot of CPU even in Direct2D mode - is that the best we can do?

Even in Direct2D mode, the juce Graphics class still does significant work in software before calling the LowLevelGraphicsContext. For example, stroking a complex Path often involves the Graphics class breaking the original Path into small piecewise segments, then creating a second Path, which is then in turn passed to the low-level renderer (check out Path::addCentredArc and Graphics::strokePath). Ideally all of that work would be done in the GPU.

Also - allocating GPU resources is expensive; it’s much better to set up and reuse the same Direct2D objects (Improving the performance of Direct2D apps - Win32 apps | Microsoft Learn). Say you call Graphics::fillRoundedRectangle; the Graphics class will allocate a Path object, add a rounded rectangle to the Path, tell the renderer to draw a filled path, then free the original Path. In Direct2D mode, that Path is converted to a Geometry, which is a Direct2D GPU resource. The Geometry is then rendered and freed. That all makes path rendering much more costly than it needs to be; creating the Path takes time, the Path to Geometry conversion takes more time, and pushing the Geometry into the GPU even longer. Pre-creating, retaining, and reusing the Geometry would be much more efficient.

A JUCE LowLevelGraphicsContext really only does a few things; it can fill rectangles, fill a Path, draw an Image, draw a Line, or draw text. The rest of the methods handle clipping and transparency and such. To get better Direct2D performance, the LowLevelGraphicsContext will need to be extended to handle more drawing operations and support retained GPU resources.

These changes would make a big difference with Direct2D, but of course there’s the questions of portability and breaking existing code. More on this topic to come…

What’s all this about painting on another thread?

Direct2D allows you to paint from any thread on demand. So instead of waiting for a timer or VBlank notification, calling repaint, and then waiting for the Windows to tell you it’s time to paint, you just do it right away. This alone makes a big difference in reducing frame jitter and cuts down the load on the message loop.

Mozilla Firefox switched to off-message-thread painting a few years ago for similar reasons: Off-Main-Thread Painting – Mozilla Gfx Team Blog

But, of course, the JUCE component hierarchy is definitely not thread safe and painting off the message thread is a huge change. The demo plugin does support painting on a dedicated thread, but it’s largely experimental to show that it can be done.

Known issues

  • If Nvidia G-Sync is enabled for windowed apps, the mouse will stutter and lag when moving over a Direct2D-enabled window (not a JUCE-specific issue). I recommend turning G-Sync off for Windowed mode; there may be some way to register an app with the Nvidia driver to disable G-Sync for that app.
    -There may still be bugs lurking with creating and destroying windows and the DPI not matching; need to investigate further.
    -Colors look slightly different in Direct2D mode; it may just require a gamma adjustment

Let me know what you all think. I could definitely use more testers! I'll have more to show soon.

Matt

12 Likes

Hi Matt, phantastic!! Great D2D receives some love in the JUCE universe; I’ve tried the half-baked state of things in the official releases every now and then, but of course without success. I downloaded your demo project and the fork and projuced the solution file as suggested. Unfortunately, I get a handful compile errors; for example it will complain that ComponentPeer has no member ‘measuredPaintIntervalSeconds’ plus a few more serious ones. Could you check maybe? Anyway, great work!!

Apologies; I’ve been trying to track down performance issues and it’s sort of in bits all over the garage floor. I’ll put together a clean build.

Matt

OK, it should be in reasonable shape now. I’ll post an update shortly.

https://github.com/mattgonzalez/JUCE/tree/direct2d

Matt

1 Like

Hi everyone-

I’ve made quite a few changes since the May 23rd update; I’ve added more drawing primitives, cleaned up the clipping, and moved the swap chain presentation into a separate thread. I’m focusing on getting rid of bottlenecks and stalls; the JUCE TableListBox demo has proven especially vexing since it can draw a lot of text all at once.

Here’s a list of all the changes:

  • The renderer now builds a Direct2D command list from the message thread and then passes the command list to a dedicated presentation thread
  • Reworked the window update region code so that Direct2D works more like the software renderer
  • Fixed window resizing (again!)
  • Added Direct2D support for various graphics primitives
    • Drawing and filling rounded rectangles
    • Drawing and filling ellipses
    • Path stroking
    • Drawing rectangles
  • Added support for drawing an entire glyph run at once
  • Added support for horizontal text justification
  • Removed unnecessary repaints while the window is only moving and not resizing

Be sure to #define these flags for your project:

#define JUCE_DIRECT2D 1
#define JUCE_WAIT_FOR_VBLANK 0

Why add a dedicated render thread?

The simplest form of drawing with Direct2D works like this:


direct2d->BeginDraw();          // start queuing up GPU commands
direct2D->FillRectangle(...);   // add a GPU command
...more drawing...              // add another GPU commands
direct2D->EndDraw();            // render the GPU commands on the back buffer
swapChain->Present();           // Display the back buffer on the next vblank (swap the back & front buffers)

The call to swapChain->Present will block until the next vertical blank interval, which can be many milliseconds. You can pass a flag telling the swap chain to swap right now without waiting, but that resulted in a lot of ugly flickering. So I think presenting the swap chain will have to be done in a separate thread to avoid stalling the message thread.

This is more or less the inverse of the existing JUCE VSyncThread, which waits for the vblank and then triggers painting. Instead, with Direct2D, the renderer paints, and then tells the swap chain to present the painted window at the next vblank. I’m not sure if the two approaches are compatible. For now, I’ve added a JUCE_WAIT_FOR_VBLANK compilation flag that disables the JUCE VSyncThread and have been testing with that flag disabled.

I tried quite a few different approaches; ultimately, I ended up with building a Direct2D command list (https://learn.microsoft.com/en-us/windows/win32/api/d2d1_1/nn-d2d1_1-id2d1commandlist from the message thread and then passing the command list to the presentation thread to be rendered and then presented.

As soon as the call to Present returns, the renderer checks to see if there are any areas to repaint and starts working on the next presentation. So if everything keeps up, the window can paint at the monitor refresh rate.

What’s this about the update region?

Turned out the Direct2D renderer was not properly checking the window update region and was painting a lot more than it needed to. The Direct2D renderer should now work more like the software renderer and only repaint the areas of the window that have changed. That made a big performance difference!

Any known issues?

  • Various Direct2D calls that normally take microseconds will sometimes take several milliseconds; I think they are stalling in the GPU, but need to figure out why.
  • juce::VBlankAttachment doesn’t work at the moment
  • Drawing text could still be faster

Matt

10 Likes

I’m very interested in this as drawing performance on Windows is one of the biggest complaints we get. It would be great if this gets to a point the be able to merge in to JUCE.

Just wanted to say keep up the good work :+1:

7 Likes

Thanks, Dave!

Matt

1 Like

In that same spirit, your 2019 ADC session with Fabian Renn-Giles on real-time techniques and the farbot library proved invaluable for the new D2D presentation thread. I ended up using a hybrid of the techniques shown; your flowchart was really helpful.

https://www.youtube.com/watch?v=PoZAo2Vikbo

Much appreciated!

Matt

1 Like

The JUCE team is also keeping a keen eye on this.

I’ve just done some very quick tests to see how the renderer performs vs the software renderer. This absolutely isn’t a proper benchmark as setting those up requires a lot of effort, but it’s instructive nevertheless.

The test machine is an old Intel i7-8750H Windows laptop with a GeForce GTX 1060 graphics card attached to a 3840 x 2160 external display.

The test itself is opening DemoRunner full screen on the external display on the landing page (no text updating, drawing at 30Hz) and constantly wiggling the leftmost slider on the WidgetsDemo → Sliders page. CPU usage taken from the process listing in Task Manager.

The results are: CPU usage, time spent in each handlePaintMessage call, the rate that handlePaintMessage is called.

Landing page:

Software: 17%, 34 ms, 30 Hz
Direct2D: 25%, 37 ms, 30 Hz

Wiggling slider:

Software: 1%, 19 ms, 50 Hz
Direct2D: 8%, 24 ms, 40 Hz

For a new renderer to be included in JUCE it needs to show significant performance gains in typical use cases like these. At the moment, unfortunately, it looks like we’re still some way off. Is there something I’m missing? Do you see similar numbers with a similar test?

There also appears to be a memory leak in its current state, but I kept the testing sessions short so it shouldn’t be a factor.

2 Likes

My two cents:

a) the speed of the new renderer should also evaluated in a broader range of use cases, e.g. painting (rescaled) bitmaps, not just vector drawing primitives.

b) it’s not just the speed that counts, but also the quality of the rendering: the native macOS renderer does a much better job at drawing rescaled bitmaps than the Windows software renderer. If Direct2D has a comparable speed but with a much nicer looking result than the current software renderer, I’d say it’s totally worth it!

2 Likes

Hi Tom-

Thanks for taking the time to check this out. Looks we are seeing significantly different results, so let’s see if we can dig deeper.

I’m running Windows 10 on an Intel Core i9-9900K CPU @ 3.60GHz with a GeForce RTX 2080 Ti into a 3440x1440 120 Hz monitor.

I have another branch of JUCE called direct2d-develop that timestamps each paint call. I rebuilt DemoRunner with that branch. I tried to reproduce your test; here’s what I’m seeing:

Landing page:

Software: 3.1% CPU, handlePaintMessage 9.2 ms average, 12.2 ms max
Direct2D: 2.1% CPU, handlePaintMessage 0.6 ms average, 3.1 ms max

Wiggling slider:

Software: 1 % CPU, 2.3 ms avg, 12.3 ms maximum
Direct2D: 1.5 % CPU, 0.2 ms avg, 0.9 ms maximum

So - let’s figure out source of the disparity.

What Windows version are you running on your laptop?

How are you measuring the paint time? QueryPerformanceCounter in the code, or some external profiling or tracing tool?

Can you please elaborate more in the memory leak you’re seeing?

What commit are you building?

What preprocessor flags are you setting?

I built the DemoRunner with these flags:

JUCE_DIRECT2D=1
JUCE_WAIT_FOR_VBLANK=0
JUCE_DIRECT2D_METRICS=1
JUCE_STRING_UTF_TYPE=16

Here’s the executable:
https://www.dropbox.com/s/kdmjbvn3a5wbywy/JUCEDemoRunner_D2D_2023-June-29.zip?dl=0

There’s plenty of room for performance improvements still. I’ll try some lower-spec machines and see what I get. We could also try running a build that’s been instrumented with ETW.

Matt

1 Like

I think we can have both better image quality and better performance.

As far as that goes, try the GraphicsDemo in the DemoRunner I just posted. Check out the Images: RGB Tiled and ARGB Tiled. Try both the software and Direct2D modes.

Matt

4 Likes

1 million times this!

I would love to get rid of the workaround I’ve added using AVIR from Gin in order to make my Windows images look the comparable to the macOS version. (although I’d still have to keep it for the Linux version I think… :person_shrugging:)

1 Like

The direct2d-develop branch is much more promising! I’m getting numbers similar to yours and the memory leak is gone :tada:

Our next couple of months are pretty busy, but we’ll then look into this much more thoroughly.

9 Likes

Excellent! I was anticipating a more difficult debugging process. Sounds like I need to do some merging.

Thanks again for trying this out.

Matt

1 Like

Is this required to get the performance gains when rendering text? We prefer to stick with UTF-8 if possible.

DirectWrite will only display UTF-16, so UTF-8 strings will need to be converted first. But that’s not a big CPU hit. It may not be an issue in practice; let’s see how the profiling looks.

If you have a string to display that can be reused, ideally we’d convert that into a DirectWrite text layout in the GPU. Then you’d just have the initial setup cost and it’s very efficient for both the GPU & the CPU.

Matt

Actually UTF-8 vs UTF-16 might be a nonissue depending on which Graphics method you are calling to render text. Let me dig into this a little more.

Matt

I’ve updated the direct2d branch to match the direct2d-develop branch.

Changes since June 24th:

Fixed drawing open-ended Path
Fixed painting bitmaps
The internal saved state stack now uses a std::stack

Matt

1 Like

I tried building the JUCE TableListBox demo with both UTF-8 and UTF-16 and didn’t really see any performance difference.

Matt

1 Like