LowLevelGraphicsSoftwareRenderer


#1

This was inspired from Jules response to the thread regarding hinting:

I hope that in the rush to make everything transformable, and allow the operating system to do most of the rendering work, we don’t lose the valuable ability to replace the LowLevelGraphicsSoftwareRenderer with one that is optimized for the application. I make extensive use of rectangle fills both solid and with opacity, with some proprietary effects thrown in there. A lot of the rectangles are horizontal and vertical lines. The performance using the built-in LowLevelSoftwareGraphicsRenderer is pretty bad.

I haven’t complained about it because optimization is a last step but I already know from my efforts in the previous iteration of my application (for which I am throwing away my home-brew GUI code and replacing it with Juce), that significant performance improvements result specifically from hand tuning the cases of drawing fully opaque, and partially transparent, rectangles, vertical lines, and horizontal lines. Note this includes vertical blends, which are merely a general case of many horizontal lines.

I highly doubt that Direct2D, or whatever Microsoft’s new-fangled proprietary GDI API flavor of the month, will outperform a hand tuned solution, since it will always be written for the general case and can usually not take advantage of properties unique to the individual application.

So I guess what I am saying is please let’s not ignore this use-case. Its true that the LowLevelGraphicsSoftwareRenderer can’t be changed for a heavyweight ComponentPeer in the current code base but making that change is fairly simple.


#2

I’ve certainly no plans to change the architecture of those graphics context classes - I think they work really well. And part of the process of supporting new rendering platforms like CoreGraphics, Direct2D, etc is that everything needs to be easily swappable and work in a standard way, so I don’t think you should worry.


#3

What’s the best way to change the ComponentPeer so that the choice of renderer can be changed?

Maybe a LookAndFeel function?

virtual LowLevelGraphicsContext* LookAndFeel::GetLowLevelGraphicsContextForPeer( Image& imageToRenderOn, ComponentPeer* peer ) { return new LowLevelGraphicsContext( imageToRenderOn ); }


#4

I just tweaked my build environment so I can build my application using either the Juce 1.51 version, or the tip from GIT. I compiled the Release version using the tip, and having made no other changes to code, the user interface redraws a little bit faster than from Juce 1.51.

Cool!


#5

Yes, I did a lot of optimisation since then.

Not sure about your question about changing the renderer… It’s not really a look+feel thing, I’d have to have a think about that one.


#6

Looking more closely at the implementation of LowLevelGraphicsSoftwareRenderer, I don’t think that I need to replace the whole thing. I’m only interested in the case where the clip is a RectangleList, and then only for the following cases:

single pixel plot: solid color, transparent color, solid gray, transparent gray
vertical line: solid color, transparent color, solid gray, transparent gray
horizontal line: solid color, transparent color, solid gray, transparent gray
rectangle fill: solid color, transparent color, solid gray, transparent gray
vertical blend: solid color, transparent color, solid gray, transparent gray

I made most of my interface controls using only shades of gray since they can be drawn faster.


#7

I really doubt whether you’ll be able to make much difference to those, I’ve already done a lot of optimisation on the straight line drawing. If you can find any ways to speed it up without impacting other operations, I’d be keen to hear!

So you profiled your app using grey vs coloured graphics, and actually found a measurable difference in cpu? I’m somewhat sceptical…


#8

Well here is an example of my home-brew rendering that is optimized for solid color horizontal lines

  void hline_solid(
    int count, pixval_t* dest,
    const pixval_t* color )
  {
    typedef unsigned long T;

    int rem=((unsigned long)dest)%4;
    pixval_t c0=color[0];
    pixval_t c1=color[1];
    pixval_t c2=color[2];
    int n=3*sizeof(T);
    int end=3*count;
    T l0, l1, l2;
    switch( rem )
    {
    case 1:
      l0=(((T)c1)<<0)+(((T)c2)<<8)+(((T)c0)<<16)+(((T)c1)<<24);
      l1=(((T)c2)<<0)+(((T)c0)<<8)+(((T)c1)<<16)+(((T)c2)<<24);
      l2=(((T)c0)<<0)+(((T)c1)<<8)+(((T)c2)<<16)+(((T)c0)<<24);
      end-=1;
      *dest++=c0;
      break;
    case 2:
      l0=(((T)c2)<<0)+(((T)c0)<<8)+(((T)c1)<<16)+(((T)c2)<<24);
      l1=(((T)c0)<<0)+(((T)c1)<<8)+(((T)c2)<<16)+(((T)c0)<<24);
      l2=(((T)c1)<<0)+(((T)c2)<<8)+(((T)c0)<<16)+(((T)c1)<<24);
      end-=2;
      *dest++=c0;
      *dest++=c1;
      break;
    case 3:
      end-=3;
      *dest++=c0;
      *dest++=c1;
      *dest++=c2;
    default:
    case 0:
      l0=(((T)c0)<<0)+(((T)c1)<<8)+(((T)c2)<<16)+(((T)c0)<<24);
      l1=(((T)c1)<<0)+(((T)c2)<<8)+(((T)c0)<<16)+(((T)c1)<<24);
      l2=(((T)c2)<<0)+(((T)c0)<<8)+(((T)c1)<<16)+(((T)c2)<<24);
      break;
    };
    int big3=end/n;
    end-=big3*n;
    int big1=end/sizeof(T);
    end-=big1*sizeof(T);

    while( big3-- )
    {
      *((T*)dest)=l0;
      dest+=sizeof(T);
      *((T*)dest)=l1;
      dest+=sizeof(T);
      *((T*)dest)=l2;
      dest+=sizeof(T);
    }

    switch( big1 )
    {
    case 2:
      (*(T*)dest)=l0;
      dest+=sizeof(T);
      (*(T*)dest)=l1;
      dest+=sizeof(T);
      break;
    case 1:
      (*(T*)dest)=l0;
      dest+=sizeof(T);
      break;
    }

    switch( end )
    {
    case 3:
      *dest++=c0;
      *dest++=c1;
      *dest++=c2;
      break;
    case 2:
      *dest++=c1;
      *dest++=c2;
      break;
    case 1:
      *dest++=c2;
      break;
    }
  }

Well Juce already has some code for using memset() for horizontal line fills when the color components are all the same:

    forcedinline void SolidColorEdgeTableRenderer::replaceLine (PixelRGB* dest, const PixelARGB& colour, int width) const throw()
    {
        if (areRGBComponentsEqual)  // if all the component values are the same, we can cheat..
        {
            memset (dest, colour.getRed(), width * 3);
        }
        // ...
    }

I believe the MSVC runtime has some optimizations that take advantage of multi word stores for x86 architectures. So clearly, the case of filling a rectangle with a gray can be faster than filling with a color.


#9

Here is another example, a hand rolled routine for filling a rectangle with a vertical blend between two colors with associated transparency values. I haven’t done a full analysis of Juce but I think in some places there is a shortcut of right shifting 8 bits when it should be dividing by 255. In this routine, the calculations are numerically exact:

  inline pixval_t fixmul( int a, int b ) // exact
  {
    return pixval_t( ( ( a * b ) * 0x8081 ) >> 23 );
  }

  void vblend_alpha(
    int rows, int cols, int destRowBytes, pixval_t* dest,
    const pixval_t* color0, const pixval_t* color1, pixval_t alpha0, pixval_t alpha1 )
  {
    int mul= rows;
    int l0 = int(color0[0]) * mul;
    int l1 = int(color0[1]) * mul;
    int l2 = int(color0[2]) * mul;
    int a0 = int(alpha0) * mul;
    int d0 = ( ( int(color1[0]) - color0[0] ) * mul ) / rows;
    int d1 = ( ( int(color1[1]) - color0[1] ) * mul ) / rows;
    int d2 = ( ( int(color1[2]) - color0[2] ) * mul ) / rows;
    int da = ( ( int(alpha1) - alpha0 ) * mul ) / rows;

    destRowBytes -= cols * 3;
    while( rows-- )
    {
      int c0 = l0 / mul;
      int c1 = l1 / mul;
      int c2 = l2 / mul;
      int a  = a0 / mul;

      int col=cols;
      while( col-- )
      {
        *dest++ = *dest + fixmul( a, c0 - *dest );
        *dest++ = *dest + fixmul( a, c1 - *dest );
        *dest++ = *dest + fixmul( a, c2 - *dest );
      }
      dest += destRowBytes;

      l0 += d0;
      l1 += d1;
      l2 += d2;
      a0 += da;
    }

#10

I like the fact that the software renderer has different code paths for when the clip is an edge table, versus a rectangle list.

For the case where a straight GUI is updating, the ComponentPeer takes the update region and converts it into a nice tidy RectangleList. The low level renderer sets that as the clip.

Now correct me if I’m wrong but in that case it iterates over each rectangle in the list, and then iterates over each row in that rectangle?

ClipRegion_RectangleList::iterate(Renderer& r) const throw()
{
  // ...
            for (int y = rect.getY(); y < bottom; ++y)
            {
                r.setEdgeTableYPos (y);
                r.handleEdgeTableLineFull (x, w);
            }
  // ...
}

I would like to handle each Rectangle in the RectangleList as a whole, instead of having iterate() break it up into lines for me. This way, I can calculate values useful for the entire rectangle just once, and re-use them for each line, instead of having to calculate them every time - see my example for vblend_alpha and hline_solid, and fill_solid.

Check out this implementation of solid rectangle color fill, its pretty damn fast. It takes advantage of precalculating some values for the whole rect instead of doing it per-line:

  void fill_solid(
    int rows, int cols, int destRowBytes, pixval_t* dest,
    const pixval_t* color )
  {
    typedef unsigned long T;

    int rem=((unsigned long)dest)%4;
    pixval_t c0=color[0];
    pixval_t c1=color[1];
    pixval_t c2=color[2];
    int n=3*sizeof(T);
    int end=3*cols;
    T l0, l1, l2;
    switch( rem )
    {
    case 1:
      l0=(((T)c1)<<0)+(((T)c2)<<8)+(((T)c0)<<16)+(((T)c1)<<24);
      l1=(((T)c2)<<0)+(((T)c0)<<8)+(((T)c1)<<16)+(((T)c2)<<24);
      l2=(((T)c0)<<0)+(((T)c1)<<8)+(((T)c2)<<16)+(((T)c0)<<24);
      end-=1;
      break;
    case 2:
      l0=(((T)c2)<<0)+(((T)c0)<<8)+(((T)c1)<<16)+(((T)c2)<<24);
      l1=(((T)c0)<<0)+(((T)c1)<<8)+(((T)c2)<<16)+(((T)c0)<<24);
      l2=(((T)c1)<<0)+(((T)c2)<<8)+(((T)c0)<<16)+(((T)c1)<<24);
      end-=2;
      break;
    case 3:
      end-=3; // fall through
    case 0:
    default:
      l0=(((T)c0)<<0)+(((T)c1)<<8)+(((T)c2)<<16)+(((T)c0)<<24);
      l1=(((T)c1)<<0)+(((T)c2)<<8)+(((T)c0)<<16)+(((T)c1)<<24);
      l2=(((T)c2)<<0)+(((T)c0)<<8)+(((T)c1)<<16)+(((T)c2)<<24);
      break;
    };
    int big3=end/n;
    end-=big3*n;
    int big1=end/sizeof(T);
    end-=big1*sizeof(T);

    destRowBytes-=3*cols;
    while( rows-- )
    {
      switch( rem )
      {
      case 1:
        *dest++=c0;
        break;
      case 2:
        *dest++=c0;
        *dest++=c1;
        break;
      case 3:
        *dest++=c0;
        *dest++=c1;
        *dest++=c2;
        break;
      case 0:
      default:
        break;
      };

      int col=big3;
      while( col-- )
      {
        *((T*)dest)=l0;
        dest+=sizeof(T);
        *((T*)dest)=l1;
        dest+=sizeof(T);
        *((T*)dest)=l2;
        dest+=sizeof(T);
      }

      switch( big1 )
      {
      case 2:
        (*(T*)dest)=l0;
        dest+=sizeof(T);
        (*(T*)dest)=l1;
        dest+=sizeof(T);
        break;
      case 1:
        (*(T*)dest)=l0;
        dest+=sizeof(T);
        break;
      }

      switch( end )
      {
      case 3:
        *dest++=c0;
        *dest++=c1;
        *dest++=c2;
        break;
      case 2:
        *dest++=c1;
        *dest++=c2;
        break;
      case 1:
        *dest++=c2;
        break;
      }

      dest+=destRowBytes;
    }
  }

#11

oh yeah I hereby release all my code examples in this thread to the public domain.


#12

Thanks, this is all interesting, but the fact that you’ve not mentioned using a profiler is causing the “premature optimisation” alarm bells in my head to ring very loudly!

I’ve not profiled it either, but my suspicion would be that if you’re rendering a very large number of small pixels/lines/etc, then most of your cpu will be burned by the overhead of making repeated calls through the Graphics class, down through the clipping logic, etc, and that the time spent actually drawing the pixels will be be small by comparison, in which case the kind of optimisations you’re suggesting would make little real difference.

I’ve thought about adding a drawMultipleRectangles() method which would let you pre-build an array of rectangles and cut down the clipping overheads, but have never had a chance to experiment and see if it’d be worth the effort.


#13

I have used the Visual Studio 2008 profiler extensively, it was how I came to the conclusion that these routines were the answer.

Let me be clear, I am porting my application from a home-brew UI to Juce. My home-brew code works similarly to Juce, in that on Windows I create DibSection, draw everything myself according to the rectangles that need updating, and then transfer that to the screen. When I profiled my application, it was the drawing and blending of straight lines and rectangles that took up all the time.

I think not, it never showed up in my profile although I have not profiled Juce specifically yet, since I am not done porting. What came to the top, are specifically the operations that the functions I provided address.

Another problem is that the Visual Studio 2008 profiler does not function on Windows 7 64-bit. It does, however, work on 32-bit and when I have some time I will go through the effort of creating a new partition just so I can run the profiler.

I tried that in my own application and it didn’t help. Honestly, the Juce approach to RectangleList clip region, Graphics class API, underlying SoftwareRenderer, and hierarchical redraw nature of Component (with the setOpaque() optimization) is perfectly fine. You aren’t really losing much in terms of function call overhead, and Juce pays almost no penalty for the abstract interface. It’s the actual per-rectangle operations that consume the bulk of time according to profiles (once again with the disclaimer that I have not profiled it specifically in Juce but rather in my own framework that is extremely similar).

My conclusions:

Graphics class and API: Great!
RectangleList clip region and drawing: Great!
setOpaque() function in Component: Great!
Clipping overhead: Not very much at all, Great!

Solid color fill and blend of rectangles, horizontal lines, vertical lines: Definitely can be optimized!
Rectangular vertical blends: Definitely can be optimized!


#14

Phew…5 hours later and I have Windows 7 32-bit installed along with Visual Studio 2008 and the profiler but unfortunately, the latest episode of “Merlin” is distracting me.


#15

Finally got the profiler working somewhat, and just like I suspected, fillRect() and it’s descendants are at the top of the profile however there are some bugs in the profiler output that I still have to resolve.


#16

Again, interesting stuff, thanks for all the details! I’d look forward to seeing what results you see from actually profiling juce. It’s all obviously going to be heavily dependent on the exact type of drawing you do, so big areas -> lots of time in the pixel blending routines, drawing lots of small items -> more time in the overhead code.


#17

Actually I’m pretty happy with the performance of my custom drawn controls, based on the latest tip. I mean, of course it could be better - stretching the entire window is clearly slower than in my previous version of the application, that used my own implementation of rectangle rendering, and I’m not even done rewriting all the controls for Juce. But it is usable.

On a related note, did you know that you can take advantage of radial symmetry for convolutions? Gaussian blur kernel is radially symmetric so it can be done using a two pass approach (first do all the rows, then do all the columns).