Optimizing Juce LowLevelSoftwareRenderer

Jules, I’ve reached the point where the drawing speed is slow. I have most of my controls implemented. Resizing the window is painfully slow and I get buffer underruns in the audio thread when my window is maximized.

My question, if I hire out a company that specializes in optimizations and send Juce over to them to speed up LowLevelSoftwareRenderer (this will include SSE and any other tricks I can squeeze out of each platform) are you open to picking up the changes? Of course I will pay for it.

If I remember correctly Jules is working on a Direct2D version of Juce

My intuition tells me that is not the way to go. I mean don’t get me wrong, its good to have as another choice. But I think that the LowLevelSoftwareRenderer can be improved dramatically.

First of all, in my old home-brew framework (which I modeled closely after Juce) I was rendering the same controls several times faster than what I’m getting from Juce. So I know for sure that the Juce renderer can be improved.

Second, I’m interested in repainting components using multiple threads in parallel. Specifically, for rendering a given rectangle, divide the rectangle into N horizontal bands and draw them in parallel using a thread pool.

Third, I want to fully exploit processor extensions such as SIMD / SSE.

It will always be possible to get better optimizations using problem-specific information (i.e. type of things being rendered) rather than a general approach - Direct2D is a general approach.

The Visual Studio 2008 profile is completely broken for me under both Windows 7 32-bit and Windows 7 64-bit so I am in the processs of trying to get Intel C++ Composer / VTune / Parallel Studio up and running so I can have some concrete results.

I will publish any optimized subclasses of Graphics / etc… under the MIT license so everyone can benefit.

I just would like a commitment from Jules that if I spend the money, he will adjust the API for the Graphics related classes for me so I can drop in optimized replacements without patching Juce.

I couldn’t disagree more. The future is not going to involve much software rendering, it’s going to all be done with GPUs, and the only way to take advantage of that is with OS-specific rendering engines like CoreGraphics, Direct2D, openGL, etc.

A software renderer is great as a fallback, but the one I’ve got at the moment has exactly the attributes that I want: it’s portable, elegant, maintainable, and fast enough. I’ve no interest whatsoever in bloating it out with reams of unintelligible assembly or intrinsics just to gain a few percentage points in speed.

I’m not giving any commitments! But the rendering platform is already completely virtualised - new engines can be plugged in without affecting any existing code, so there should be nothing to stop you writing a new engine in parallel to what’s already there, and implementing it in whatever way you want.

Well ComponentPeer doesn’t have a way to override which low level renderer it uses… and there isn’t enough of the implementation exposed in order to subclass it without duplicating everything.

Preliminary results from the VTune profiler are showing that vertical gradients are consuming most of the runtime.

Example:

    void Win32ComponentPeer::handlePaintMessage()
    {
        //....
                LowLevelGraphicsSoftwareRenderer context (offscreenImage, -x, -y, contextClip);
                handlePaint (context);

I would like context to come from a virtual function call (i.e. createContextForPaint() or something) that I could replace. Although it’s not obvious how to do that since subclassing the Win32ComponentPeer is not an option.

Perhaps something like

LowLevelGraphicsSoftwareRenderer* LookAndFeel::createRendererForComponentPeer (ComponentPeer* peer);

In order for this to be useful, LowLevelGraphicsSoftwareRenderer implementation would need to be exposed (thinking of the stuff in namespace SoftwareRendererClasses and LowLevelGraphicsSoftwareRenderer::SavedState where most of the work is done), so a subclass can customize just a little bit of it instead of having to replace the entire implementation.

In my case I specifically want to address vertical gradients, and just those (I think). It would be nice if I could do this without changing Juce and yet handle all the clipping cases (no clip, RectangleList clip, EdgeTable clip, Image Alpha clip), while being able to fall back on Juce implementation for the cases I don’t care about.

The other thing is to divide the area requiring update into N horizontal rectangles and paint them in parallel using an individual LowLevelGraphicsContext for each one. Obviously there are some locking issues with that (cached glyphs come to mind). I wish there was enough virtual function / access qualifiers / customization in Juce to let me do this, entirely in my client code of course since I know you don’t want that in the library.

Yeah I agree fully, that’s why the ideal solution is one where I can subclass / override Juce behavior with my own external files but still leverage most of the existing LowLevelGraphicsSoftwareRenderer for the parts that I don’t need to optimize. Right now you have to replace the WHOLE thing :frowning: and there’s no hook for doing that in the ComponentPeer.

Very noticable performance increase in LowLevelSoftwareRenderer from just recompiling with the Intel C++ Compiler XE, for redrawing my entire window during a resize operation.

Still having gradient fills take up 35% of the runtime:

[attachment=0]vtune.png[/attachment]

I optimized everything I could, cut down on drawing, plus use of setOpaque, setPaintingIsUnclipped. This is the best I could do. For comparison, note that mp3 decoding (of 4 simultaneous streams) only took up 10%, less than a third of drawing. And the juce Resampler didn’t even make it into the profile thats how fast it is.

You could always use antigrain to see if it is fast enough for your needs. I once did this through adding a function in Graphics (to get the destination image), then attaching AGG to that bitmap and draw with AGG.

Edit: This would be appropriate http://code.google.com/p/graphin/

I doubt agg is going to be that much faster, the structure of the rendering code is similar to Juce but I will check it out.

I want to optimize specifically for my cases of drawing, which is vertical blends, and also alpha blending with the color black - I have routines that are hard-coded to blend only black into the destination as i make heavy use of transparent black frames, drop shadows, and what not.

I’d be highly interested in hearing your results with agg, let us know them!

Vinn, are you aware of this?
http://code.google.com/p/fog/