Graphics rendering performance and optimization

Hello,

We are facing a significant performance loss in graphics rendering since Juce v2 (approximately since the big refactor during last summer).

Because it is really impacting our products, I made some profiling and analysis of the current code, in order to locate where we are loosing time now.

My first finding is a very simple yet very efficient tiny change in two internal methods of EdgeTableFillers::ImageFill. Basically, the current code is branching on the same condition for each pixel. I just moved the test globally in the method, and the result of this is a significant performance improvement. I looked to the generated assembly, and it was no contest...

With that simple modification, I get huge perf improvements, says my profiler...

 

Here is a patch for this simple change, I think we could all benefit from this.

 

I am also investigating on some other simple yet efficient changes in all that rendering stuff, I can send you patches for each small useful modification I find if you're interested, just let me know!

 

Thanks!

Thanks! Very surprising that just swapping those conditions around would make much of a difference, but much appreciated! I'll give it a whirl and see what happens.

And sure, small + useful modifications are always welcome!

I saw you've integrated the patch, thanks!

Now my profiler tells me we spend a lot of time in the core PixelXXX::blend() ops. By looking at the generated assembly, I see a lot of moves, masks and shifts operations. This is the result of all the template stuff of the PixelXXX classes.

I'm sure something can be done in order to improve the various blend() functions. I'll investigate it but there might be a way of writing a couple of specializations (like PixelARGB::blend(PixelRGB&), then PixelARGB::blend(PixelAlpha&) etc.) that would drastically decrease the number of generated instructions, because it seems to me that a lot of operations can be combined together.

The template based code is a great way to get everything inlined and avoid inheritance and vtables in that area, but a couple of added specialization could improve all that even further.

By doing so, I think we can target at least a factor 2 in software pixel blending, what would bring a massive improvement for all your rendering code.

I'll test some things, try a few specialization and see how fast I can get performance improvements. I'll keep you posted.

 

Also, in our products, we use a lot of images. We try to keep everything the most efficient we can, by making sure that we only translate our images, that we don't duplicate them etc.

Lately, to get retina-ready, we made some changes so now we have all our images cached in both low and hi res. I also made some helper code that draws the right image depending on the scale of the display that will contain the image (I check the screen coordinate of the center of the image). This does the trick, but to me it looks like a complicated way of gathering a simple information.

So, do you think you could add an easy way of retrieving the display scale factor of a GraphicsContext ? I don't know, something like Graphics::getDisplayScaleFactor()...I guess it's just a matter of retrieving the scale of the current low-level context transform, but I might be missing something.

 

Thank you!

Thanks!

Any optimisations you can contribute are extremely welcome of course!

But.. the reason I've not spent a lot of time optimising the software renderer myself is that no matter how much you optimise it, it's not a long-term solution for rendering. Now that we're driving huge high-DPI displays, there's no way that even if the render functions had 4x their current performance, the CPU just doesn't have the memory bandwidth to scale to that amount of data. The only realistic future for decent 2D rendering is to shift it onto the GPU.

So really, if you want fast rendering and want to help optimise it, the smart place to look would be the GL rendering engine. And ultimately, when openCL is pervasive enough, that'd be the ideal platform for a renderer, which would allow entire Path objects and high-level operations to be offloaded entirely onto the GPU.

So, do you think you could add an easy way of retrieving the display scale factor of a GraphicsContext ? I don't know, something like Graphics::getDisplayScaleFactor()...I guess it's just a matter of retrieving the scale of the current low-level context transform, but I might be missing something.

You can use Graphics::getInternalContext().getPhysicalPixelScaleFactor()

You can use Graphics::getInternalContext().getPhysicalPixelScaleFactor()

I missed that one! Thanks, it's way better now!

 

Now about the rendering performances, two things:

1. First, about the "should we use accelerated rendering or not" debate, for now, I would say that the CoreGraphics and the Direct2D implementations are either not performant, either not strong enough. That leaves us with the OpenGl option. But as a plugin developer, our philisophy really is the simpler, the lighter, the better. We use OpenGl in one plugin for which we definitely needed high graphics performances, because of the realtime analysis and because of the advanced aspect of the graphics we needed to display.

But for all our other more 'regular' products, which only display images and draw basic stuff on small portions of the screen, a decent software renderer should be enough for us. Besides, one OpenGl Context per plugin view means one thread per opened GUI. It also means that each one of the GL Context threads will be synced to the main thread in order to perform the rendering while the MessageManager is locked. Basically, we get performance loss starting with 4 opened GUI, depending on the host, of course. So I would say that Juce's model for handling OpenGL rendering is not very suited to plugin development, because of that strong dependency between the message thread and the paint() callbacks that can be handled asychronously.

So for now, we're kind of happy with the Software Renderer, it's lightweight, it's completely cross-platform, and it offers decent performances (until last summer, that's my next point...).

About OpenCL and promising future stuff, we would be glad to see Juce using those, of course, especially for hi-res display handling, as you mention.

 

2. So then, I spent a few hours on analyzing and tracking the Software Renderer perfs today, and I found what the main bottleneck is. First, I have found a couple of small improvements to add to the RenderingHelpers stuff, I'll clean it up and send you a patch. Basically, it's just a few refactors of some core functions (loops on lines/pixels etc.) that allow the compiler to generate better and shorter assembly code.

But then I also found something else in the PixelFormats code. Last summer, after updating Juce at some point, we had to deal with a big perf loss. In Debug, some of our products that were working fine just before the update completely failed to run properly with the new Juce version, the message thread was just stuck handling endless paints. To address that, we just reworked and optimized all our drawing code so it is as efficient as possible. 

And today, in addition, we are facing some issues with Retina displays. So I ran some tests, and I found the cause of that perf loss, here it is:

https://github.com/julianstorer/JUCE/commit/9affbafa750d65e088b74955bc6b3a66cc73dfb2

Basically, you made a fix to add clamping inside each Pixel::blend() call. The thing is, in my basic tests, with and without that changeset (just calling blend() 100000000 times and see how long it takes), the unclamped version of the PixelARGB::blend(PixelARGB&) is approximately 2.5x faster than the newest version.

As this single operation is basically the most performed operation during any piece of drawing, we can reasonably say that after that commit, the Software Renderer was roughly 2.5x slower than before...

I completely understand why you needed to add that additional clamping, but I'm sure we can find a faster way of dealing with it.

I will test a few things and see how it goes, I'll let you know about my findings, but at first, if something needs to be done, that's here...

The unclamped version is about 10-15 instruction, the clamped version is about 40-50 instructions. I'm sure there's something in between...

I'll keep you posted.

Thanks!

 

EDIT: I've attached some basic test code, you can run it here: http://www.compileonline.com/compile_cpp11_online.php, it's a very dumb test, but it shows the point...

Well, it's very shaky to base your numbers on that website, where the code is running without any optimisation, and to use a test program which repeatedly uses the same memory location rather than a large block of data.. But yes, the clamping will clearly cause some kind of performance hit.

Unfortunately I can't see any easy way of removing it without allowing overflows.. There are probably some SSE instructions that would do the job even faster than the original code, but that'd take a bit of research to figure out.

I attached this example just to prove the point and also to isolate that specific piece of code. I get the same kind of results in our own environment, when profiling Juce's actual code in a real environment and looking at the generated assembly. The example is just an illustration of the problem.

I'll get back to you if I find anything that could improve blend operations. I had a few ideas I didn't had the time to implement, test and measure yet.

 

Hi,
You could use lookups to speed up the process, the clamp would than already be calculated in the lookup. 
And indeed having a few specializations for the additive or alpha blend does make a lot of sense.

 

works in most apps that don't blend images over things. give it a try.

;-)


inline uint32 clampPixelComponents (uint32 x) noexcept
{
    //return (x | (0x01000100 - maskPixelComponents (x))) & 0x00ff00ff;
    // XXX when we know we dont need clamping !!
    return x;
}