Software Renderer needs optimized classes


#1

When iterating over pixels in the source and/or destination image, the software renderer increments by a variable. It does the equivalent of

dest += image.pixelStride;

This has turned out to cause a performance hit, because previously it would increment by a constant. For example:

PixelRGB* dest;
//...
dest++;

What we need is to take classes like EdgeTableFillers::SolidColour and make specialized versions of those classes which are specific to a given pixelStride (3 or 4). This can be done with inheritance to minimize duplicated code.

In order to call these new classes we would want to introduce a conditional at the higher level:

template <class Iterator, class DestPixelType>
void renderSolidFill (Iterator& iter, const Image::BitmapData& destData, const PixelARGB& fillColour, const bool replaceContents, DestPixelType*) {
    if (replaceContents)
    {
        if (destData.pixelStride==3) {
          EdgeTableFillers::SolidColour3 <DestPixelType, true> r (destData, fillColour);
          iter.iterate (r);
        else if (destData.pixelStride==4) {
          EdgeTableFillers::SolidColour4 <DestPixelType, true> r (destData, fillColour);
          iter.iterate (r);
        }
        else {
          EdgeTableFillers::SolidColour <DestPixelType, true> r (destData, fillColour);
          iter.iterate (r);
        }
    }
//...

#2

You know, I’m really struggling to see why that would make such a big difference… Adding a constant or a value should be almost identical in performance if the compiler keeps the stride in a handy register, so all I can think is that the compiler must be failing to optimise properly because it’s being asked to fetch the pixelStride value from a data structure. That should easily avoidable by using a local variable copy of the stride for the duration of the loop.

To test this out, I’ve just checked-in a quick clean-up… All the line-rendering loops actually followed the same pattern, so I’ve D.R.Yed them with a macro, and made sure that they always hold the stride in a local variable. I’d be interested to know what difference this makes. (Doing it this way also means that instead of only focusing on the function that zamrate is obsessing over, this should help all the other code-paths that use the same kind of loop too).


#3

Intel instruction set has these three types of adds:

immediate : add a small integer constant to a register
direct : add one integer register to another integer register
indirect : add an integer at a memory location pointed to by an address register, to another integer register

The fastest is of course immediate while the slowest is indirect.

Visual Studio does not perform the optimization, instead it defers to indirect mode (I tried it with all possible optimization settings). Using a local variable will cause Visual Studio to emit the instruction in direct mode. This is still not as fast as immediate.

Here’s Visual Studio optimized assembly output for the three cases. We can see three functions:

fill1() indicrect increment mode (slowest)
fill2() immediate increment mode (fastest)
fill3() direct increment mode (middle of the road) <-- this is what you get when you use a cached local.

[attachment=0]AsmTest.png[/attachment]

TL;DR version:

fill1() : add eax, dword ptr [esi+4]; // What JUCE uses now, slowest
fill2() : add ecs, 4; // Fastest, using a constant
fill3() : add edx, esi; // What you get with a local variable

There is really no substitute for profiling, and analyzing the assembly language output. As I stated earlier, to fix this it will be necessary to hand-roll fill and copy functions for each typical value of pixelStride (3 and 4). See my previous code snippet. The reason I did not take the time to do this myself is because I know how picky Jules is, and this especially requires some care because its loaded with templates.

Jules, I can offer to fix this but it will look like the code snippet in my earlier post. That is, that there will be a conditional at the high level to check for pixelStride == 3 or 4, and dispatching a separate set of hand-rolled classes so that we get the best assembly output. But I need to know that you’re on board with this kind of fix.


#4

fill1() : add eax, dword ptr [esi+4]; // What JUCE uses now, slowest fill2() : add ecs, 4; // Fastest, using a constant fill3() : add edx, esi; // What you get with a local variable

Indeed - and obviously the top one would be slow, because it needs to access memory each time around the loop, which will interfere with the pre-fetching and pipelining etc, but I’d be very surprised if the other two are significantly different in this context, where the real bottleneck shouldn’t be the addition operation itself, but the memory accesses being done (and whatever copy or blend operation is involved).

My gut feeling is that either of the last two versions would end up being pipelined by the CPU and would have equivalent performance. I could be wrong about that, but would want to see some hard figures before I’m convinced otherwise!

Anyway, that’s why I made my last changes, and I’d be keen to hear what difference they made, because it may have actually sorted it all out.

Of course! But check out my changes, because even if it still needs further fine-tuning, the macro I added would be a good place to do that kind of thing.


#5

Did you even bother checking out the changes I did as posted here http://rawmaterialsoftware.com/viewtopic.php?f=2&t=10526&start=30 (full modified latest JUCE tip, at the end of that thread) and check out the rendering times in the pre-compiled binaries for OSX and Windows ?


#6

This is getting ridiculous. I checked Jules’ new updates. They are WAY behind the optimizations I proposed. Please take a closer look at what I did.

Rendering performance (JUCE DEMO, Jules latest tip vs. my modified JUCE version, as posted here http://rawmaterialsoftware.com/viewtopic.php?f=2&t=10526&start=30, VC2008 Release build, 32 Bit Graphics card):

  • Paths - Linear gradient : 1.25 ms (Jules) vs. 0.90ms (me)
  • Paths - Solid : 1.10ms (Jules) vs. 0.75ms (me)
  • Images RGB : 1.50ms (Jules) vs 0.60ms (me)
  • Images ARGB: 0.80ms (jules) vs 0.50ms (me)
  • Tiled Images RGB 1.50ms (JuleS) vs 1.00ms (me)

#7

No, sorry, but I stopped reading your post at the point where you say you’ve added new pixel formats.

Vinnie’s got the right idea here - there are a couple of hot paths in RenderingHelpers that need a bit of special-case optimisation. That’s all. Not a big deal. No need to touch any code outside of the renderer. Not sure if you’re english or not, but we have a phrase “getting your knickers in a twist” which describes the increasingly crazed tone of your posts… Relax!

But are they better than the previous code? I’m doing this a step at a time - first, I refactored the code so that these problematic loops are localised and easier to change. Then, I optimised it to keep the increment in a local variable. I’m interested in how much that helped, but that’s not the end of the process. It’s very easy to now make it specialise the loops for 3 or 4 byte increments, which I would have done by now if I hadn’t been replying to your freak-out post.


#8

Are you surprised? First you say there is not performance problem, then after I investigated for a good couple of hours why there is and I’ve spent over 10h of my weekend (cleanly) modifying JUCE, and it runs about more than hundred percent faster in some cases, you don’t even have a look at it, and finally you write “Anyway, that’s why I made my last changes, and I’d be keen to hear what difference they made, because it may have actually sorted it all out.”. Why don’t you just test against the binaries I especially pre-built for you?
Not even having tested my code is what I mainly criticise here, it doesn’t take more than a few minutes to check it out, while it took me an entire saturday to do the changes. It doesn’t cost you anything, you’re getting it for free, and still dismissing it without having had a look at it.


#9

There’s no point in me looking at it, because if your changes involve anything beyond the RenderingHelpers code, then I’m not going to use them!


#10

Well, even if I understand and side with you, jules, regarding the integrity of your code and the fact that where possible, the least code can be changed = the better, this last sentence sounds a little bit dispotic and insensitive of other people’s work and efforts.

Even if you already know that zamrate’s code will not make it into the branch, having a look at it may result in some good ideas for this fix, or even something interesting that will happen to be useful in the future.


#11

Then just change your code to always use pixelstride=4 for RGB instead of pixelstride=3 and check out my changes in the RenderingHelpers code, too (especially getting rid of % operator for each pixel in the tiles code and the --width you use nearly everywhere).


#12

No, that wouldn’t be right. First of all it would break WindowsBitmapImage for non-32 bit video cards and second it would break the Image API.

What you are suggesting is to remove the pixelStride field from Image::BitmapData, that would break a great many things (including a bunch of code I wrote).


#13

I never suggested to remove pixelStride!? In fact my proposed code, as downloadable, wouldn’t break anything at all. It’s 100% backwards compatible (even lets Image::RGB still be RGB24). You know I really don’t wanna argue about this anymore, but I guess it’s not a problem to BitBlt a 32 Bit RGB Image to a 24 Bit graphics card (even if this would make a small performance loss in the BitBlt itself, due to the conversion from 32 to 24 bits, all the rendering would be faster in JUCE at 32Bit for both RGB and ARGB). Furthermore, even if there would be a decrease in performance in the BitBlt or StretchBlt itself because of the conversion from 32 to 24 Bits, this would only affect a very small percentage of users, as 24 Bit graphics cards are rare these days. Right now, JUCE is internally using 24 Bit RGB by default which is weird let it be alone for this fact!

TBH, Vinn: I don’t care about how Jules should fix this, but it should be fixed. I’ve done enough for my part, and the code I proposed doesn’t break anything at all - if you’re not convinced please try. Now, if you think it’s not good enough or nor properly done, come up with something better - just make it as performant. The last version of JUCE is still 200% slower in some cases.


#14

Except that it isn’t, as I have told you several times FFS, have you looked at juce_win32_Windowing.cpp? I guess I have to quote it for you since you haven’t looked:

juce_win32_Windowing.cpp

pixelStride = (alwaysUse32Bits || format == Image::ARGB) ? 4 : 3;

JUCE uses 32-bit RGB Images internally to render to the screen (via Component::paint). Why is this so difficult to understand?


#15

You don’t have to tell me this all the time. I know that it uses a 32 Bit rendering surface when there’s a 32 Bit graphics card.
But when you create a RGB image and you do a drawImageAt() to that very 32 Bit surface, then it is slow! Because you’re NOT drawing a 24 Bit Image to a 24 Bit surface!
This image will be 24 Bit! That’s what I mean.


#16

[quote=“zamrate”]You don’t have to tell me this all the time. I know that it uses a 32 Bit rendering surface when there’s a 32 Bit graphics card.
But when you create a RGB image and you do a drawImageAt() to that very 32 Bit surface, then it is slow! Because you’re NOT drawing a 24 Bit Image to a 24 Bit surface!
This image will be 24 Bit! That’s what I mean.[/quote]

Okay, this is legit. Still, the fix is much simpler than what you have done with introducing a new image type. I’ve outlined the steps for making this happen, and I think Jules would be okay with it:

http://www.rawmaterialsoftware.com/viewtopic.php?f=2&t=10542

This to me makes a lot more sense than creating a new image type.