Massive performance decrease in Software Rendering


#1

from JUCE_VERSION 0x20008 to 0x2001c (which is the tip), I have a massive performance decrease in my rendering routine. It goes from 160fps down to 60fps. Has anything been changed in the Renderer? Where could this enormous drop come from?

I didn’t profile yet. Already worked 10h to move my project to the latest JUCE.


#2

Here JUCE old vs JUCE new profiles, which I did in Debug mode to get the symbols, but it’s more or less the same ratio between new and old in debug mode and in release mode.


#3

I should add that I’m exclusively drawing images.


#4

I thought you said you’d been using your own renderer?

FWIW I don’t remember changing anything significant, and haven’t seen any performance problems myself.


#5

I’m only using my own renderer which inherits from the JUCE software renderer, for adding certain functions. This is definitely JUCE stuff, as you can see in the pics I sent.


#6

According to your stats, the render functions are using about 2% of your CPU. Hardly what I’d call a problem.

And it’s meaningless to post it with the debug symbols in there. Many of those functions will be inlined so they’ll skew the results. Try it with a release build and you might get something more useful.


#7

I just did a test with Graphics::fillAll() and it is 3x slower in the new JUCE.

Tracing the call stack to the bottom, I have:

template <class Pixel> forcedinline void set (const Pixel& src) noexcept { b = src.getBlue(); g = src.getGreen(); r = src.getRed(); }

whereas in the old JUCE this was:

[code]/** Copies another pixel colour over this one.

    This doesn't blend it - this colour is simply replaced by the other one.
*/
template <class Pixel>
forcedinline void set (const Pixel& src) noexcept
{
    argb = src.getARGB();
}[/code]

Now, I don’t know why the call stack differs, … It seems that in one JUCE version a ARGB version is used where in the other a RGB version is used.


#8

I’ve now changed, in juce_win32_Windowing.cpp,

to

to force your code into ARGB usage. This helps already, but filling an entire screen with the new JUCE goes at max 90fps, whereas with the old JUCE it was over 400fps (I’m just reading the time before g.fillAll(Colours::black) and afterwards).

Here are the 2 call stacks, to compare. They look pretty much similar. It just seems that your recent code is more high-level C++ - could this be the cause for the slow rendering?


#9

Ah, seems like I found something that makes a difference of +/-500%:

(FYI, I’m compiling with VS2008 in Release build with optimizations turned on).

In juce_RenderingHelpers.h

changing this:

[code]forcedinline void replaceLine (PixelARGB* dest, const PixelARGB& colour, int width) const noexcept
{
do
{
dest->set (colour);
incDestPixelPointer (dest);

        } while (--width > 0);
    }[/code]

to this:

[code]forcedinline void replaceLine (PixelARGB* dest, const PixelARGB& colour, int width) const noexcept
{
do
{
dest->set (colour);
dest++;

        } while (--width > 0);
    }[/code]

will increase the performance from 90fps to 450fps. No wonder, this is me basically changing back the code to the old status where it wasn’t using incDestPixelPointer ().


#10

And this will increase the performance to nearly 600 fps:

forcedinline void replaceLine (PixelARGB* dest, const PixelARGB& colour, int width) const noexcept { uint32 *p=(uint32*)dest; uint32 *p2=p+width; uint32 val=colour.argb; // <- had to make PixelARGB.argb public for this to work while (p<p2) { *p=val; p++; } }


#11

Great work, thanks! Please, Jules, listen to these kinds of reports… I know that 2% CPU seems insignificant, but on iOS for instance I constantly battle with sluggish animations/redrawings using JUCE. Not dramatically sluggish, but certainly not up to par with the usual iOS standard.


#12

This is going to break stuff, because if we just use “dest++” then we are not respecting the pixelStride field of the underlying Image object.

I agree that performance is a problem but putting it back to “dest++” is not the fix. Instead, you need to go to a higher level function on the call stack and compare pixelStride with sizeof (PixeARGB) and if they are equal, then call the optimized version that uses dest++, else call the generic version that uses incDestPixelPointer().


#13

Up to Jules to find a fix. It’s not my work to fix it and it’s not my work either to spend hours to find problems in JUCE, report them and then even get slightly offensive answers like “I thought you said you’d been using your own renderer? FWIW I don’t remember changing anything significant, and haven’t seen any performance problems myself.” . I’m doing my best to make JUCE a better framework, just as everybody in here.

Let me finally add that the initial problem, namely filling a RGB image with a colour (or was it copying from one image to another?), ended up in these lines:

template <class Pixel> forcedinline void set (const Pixel& src) noexcept { b = src.getBlue(); g = src.getGreen(); r = src.getRed(); }
Now, IMHO it is conceptually impossible for this code not to be slow. We have 3 copy operations here instead of one. If both pixelstride are the same (which should be the case for software ARGB and RGB images), it should be possible to just copy from one image to the other in the ARGB to RGB case. If byte order is an issue, software images should have the same order as the bitmap needed for the BitBlt, etc… (but I suppose this is already the case).

And one final word about the 2% - I have a quadcore i7 (8 CPUs shown in task manager). Now as this is running one 1 core, it’s already 16% on one processor in the task manager, since the rendering happens in 1 thread. Now, try the same code not with an i7, but with a P4, and then you really notice how much this is.


#14

The problem is that the image rendering code is making assumptions on sizeof (PixelRGB), sizeof (PixelARGB), and the corresponding values of pixelStride. It needs to be looked at function by function, from bottom to top and rewritten correctly respecting pixelStride.

Then, we need to go through these routines one by one again and hand-roll template specializations for the magic values of pixelStride: 3 byte RGB, 4 byte RGB, and 4 byte ARGB. Once we have those hand-written specializations we can put a conditional test at the high level to dispatch to the appropriate routine based on the Image parameters. It’s not a huge undertaking but it will require some time and thoughtfulness. I don’t think Jules was trying to be snappy I think this is just a knee jerk reaction on account of a heavy workload.

Would you be interested in working with me to make the necessary changes to respect pixelStride and optimize the specializations? Jules would you be okay overseeing these changes?


#15

[quote=“zamrate”]Let me finally add that the initial problem, namely filling a RGB image with a colour (or was it copying from one image to another?), ended up in these lines:

template <class Pixel> forcedinline void set (const Pixel& src) noexcept { b = src.getBlue(); g = src.getGreen(); r = src.getRed(); }[/quote]

If its going through here then I believe it is the case for filling an RGB image with a colour that is not gray (all three components equal). When JUCE copies an image at 100% opacity it usually does it line by line with memcpy, for speed (well, that’s what it did last time I looked).

For the case of filling an RGB image with a solid colour that is not gray, it can certainly be done much faster for the case where pixelStride = 3 or pixelStride = 4.

For pixelStride = 3 we need to prepare these 32-bit unsigned values:

BGRB GRBG RBGR …

The innermost loop will copy 3 longwords at a time. If the total number of pixels is below some small threshold (say, 4, 8, or 12 pixels) then we can just do it the long way. This will prevent us from running slower on thin rectangles from the setup overhead.

For RGB or ARGB where pixelStride==4 it is simply

BGRA …

Where A is the alpha for ARGB or simply ignored for RGB. This can be done with 32-bit unsigned copies, 64-bit unsigned copies where available, and also using whatever assembly facilities are available (MMX or the equivalent?)

For all other values of pixelStride we would of course need to use the “slow” routine which you posted.


#16

But Vinn, you remember that WindowsBitmapImage is now almost always using pixelstride=4, no matter if ARGB or RGB (because the BitBlt is much faster with pixelstride=4 on 32Bit RGB gfx cards).
So even if one would have a fully optimized code for filling 24Bit RGB images, you’d still end up slowly copying the 24 Bit RGB Image back again to the 32Bit RGB final image. Why not simply use 32Bit all the time (if possible), for both RGB and ARGB, because this makes things easier and faster.
And if one really needs and RGB Image with pixelstride=3, there should be an option to create one, but it shouldn’t be the default IMHO.


#17

[quote=“zamrate”]But Vinn, you remember that WindowsBitmapImage is now almost always using pixelstride=4, no matter if ARGB or RGB (because the BitBlt is much faster with pixelstride=4 on 32Bit RGB gfx cards).
So even if one would have a fully optimized code for filling 24Bit RGB images, you’d still end up slowly copying the 24 Bit RGB Image back again to the 32Bit RGB final image. Why not simply use 32Bit all the time (if possible), for both RGB and ARGB, because this makes things easier and faster.[/quote]

Yes I agree but the code needs to be updated to respect pixelStride and also have optimized specializations.

Its not the default - the problem is that some cruft has accumulated in that code and it needs a good rooting-out.


#18

So did you want to work with me to try to fix this?


#19

Really? When I look at this constructor, I don’t see any option to make it other than pixelstride=3 for RGB software images, though.

SoftwarePixelData (const Image::PixelFormat format_, const int w, const int h, const bool clearImage) : ImagePixelData (format_, w, h), pixelStride (format_ == Image::RGB ? 3 : ((format_ == Image::ARGB) ? 4 : 1)), lineStride ((pixelStride * jmax (1, w) + 3) & ~3) { imageData.allocate ((size_t) (lineStride * jmax (1, h)), clearImage); }

Was this question targeted to my humble self or to Jules? It it was me, I’d do both ARGB and RGB pixelStride=4, modifiy the code for only using pixelstride=4 and if one needs a RGB24 Image, there’s code to convert from RGB32 to it.


#20

True that when JUCE creates images that pixelStride is locked to 3 for RGB. But remember that you can make your own pixel data and use that (I do this in LayerEffects). It is unfortunate that the JUCE Image constructor doesn’t allow for specification of pixelStride, because otherwise I would use it to allocate RGB with pixelStride == 4. Note that even Jules has to resort to using a separate allocator in the Windows implementation so that he can product a flat RGB image with pixelStride == 4 (juce_win32_Windowiung.cpp).

[/quote]If it was me, I’d do both ARGB and RGB pixelStride=4, modifiy the code for only using pixelstride=4 and if one needs a RGB24 Image, there’s code to convert from RGB32 to it.[/quote]

That’s not really an acceptable solution without rewriting a lot of the imaging portions of JUCE, because the JUCE API allows for an arbitrary pixelStride and what you are suggesting would break that.

You of course, since Jules is busy. What we need to do is add a conditional at a higher level and hand-tune the cases for known values of pixelStride (3 and 4), while leaving the generic slow path for everything else.