Juce's not quite ready OpenGL Renderer

The JUCE_OSX_OPENGL_RENDERER stuff is definitely not finished, please ignore it for now!

What I’d be interested in hearing about at the moment is how you get on at mixing 2D and 3D rendering, like I’ve done on the new GL demo page.

And BTW the render times in the demo are only measured for drawing the actual shape that’s being animated, not the entire repaint time. I’ve not written the stuff to handle dirty repaint regions yet, so in OpenGL mode, the demo is continually redrawing the entire window, which is the main reason it’s sluggish (apart from the fact that it’s just slow!)

ah, sorry for my confusion - wasn’t aware of the new OpenGL demo

ran test again this time with OpenGL demo. Result:
-on the ipad1 it’s quite slow, you can easily see every frame update - I estimate about 3 fps
-on the mac mini, with the default juce demo window size, it’s quite smooth and nice looking
however, when I maximize the window to 1920x1200 it becomes slower and a bit jerky, estimate 10 fps or so

would be interesting to have an fps counter or something in the demo so I can give you more accurate stats

some more tests

modified an existing custom component having a quite dynamic ui that does all of its drawing in the Component::paint() override,
as follows :
[list][]removed paint(Graphics& g) override and renamed to paintOld(Graphics& g)[/]
[]derived component from OpenGLComponent instead of from Component[/]
[]added override of OpenGLComponent::renderOpenGL() which calls the paintOld() method with an OpenGLRenderer-provided Graphics object :[/][/list]

void MyComponent:renderOpenGL()
{
  OpenGLHelpers::clear (Colours::darkgrey.withAlpha (1.0f));
  OpenGLRenderer glRenderer(*this); 
  Graphics g(&glRenderer); 
  paintOld(g);
}

(I hope this is the way it’s intended to be used)

issues found:
[list][]Drawing is a lot (order of magnitude at least) slower than using regular Component + paint()
(note the clipping boundaries are equal in both cases (max area of the component) so that probably does not explain the difference)[/
]
[]There is a problem with clipping when the OpenGLComponent is a child of a Viewport and the viewport needs to show scrollbars: in that case the OpenGLComponent draws outside the boundaries of the allotted viewport space[/][/list]

tested using the tip of the modules branch, and so far tested only on Windows7, will test on OSX/IOS soon
hope this helps

Yeah, the performance is annoying - some things are very very fast (e.g. large coloured areas, drawing images, vertical/horizontal lines), but edge tables and paths aren’t so good, due to the way I’ve had to do the anti-aliasing. Any suggestions by experienced openGLers as to ways I could optimise it would be appreciated!

…well yeah, of course. The GL component is a window that’s slapped on top of everything, not a normal juce component. In the future it should be possible to avoid that though, if the whole window is using GL.

using GLU tesselator ?

http://glprogramming.com/red/chapter11.html

some infos on AA

http://homepage.mac.com/arekkusu/bugs/invariance/index.html

Sadly, no… I already spent a long time writing a very cunning path-to-triangle algorithm, but the only way to reliably render it with AA ended up being slower than just creating a texture from the edge table, which is how it works now.

It’d probably be fast to use triangles if you were happy to just rely on the device’s multisampling to perform the AA, but as far as I can tell the results with that are (at best) piss-poor, or at worst, not anti-aliased at all.

I also found that the quantity of triangle data that needed to be sent to the GPU for some paths was actually not much smaller than the texture data generated from an edge table. And generating an edge table is faster than triangulating. So basically, my conclusion was that unless you’re drawing a big, simple, non-antialised polygon, triangles just make things harder.

I’d be interested to hear from anyone who’s had experience of profiling GL, and could work out where the bottlenecks really are… I suspect that it could all be improved greatly with just a few tweaks in the right places.

Few remarks, by reading at the Git (not tested here, so might worth it):

  1. Why do you need a stencil buffer for ? A depth buffer ?
    This slows things down quite a bit, and unless you actually expect Component to draw in 3 dimensions, you better leave that disabled.
  2. You should use VBO / PBO only when the buffer to send to the GPU is actually big enough to justify the overhead.
  3. I know you’ll hate me, but you need to convert your code to run on the GPU, using shaders. Ideally you should pass the list of edgetable’s side as a vertex object, and put your rasterization routine in a geometry shader. That way, you don’t send thousands of triangles to the GPU, but let the GPU do the tesselation itself. You can accumulate in a frame buffer using GL_BLEND operation, so you should be able to do anti-aliasing in the shader.
    If you want to limit yourself to OpenGL ES 2.0, then you’ll need to x-plicate the number of vertex to reach the antialiasing goal you’re targeting, and skip the geometry shader stage.
So instead of:
for each path:
   scan-line raster the path to create edgetable
   render edgetable

You should probably have:
If you've geometry shaders, 
    Send the path curves characteristics in a buffer object, and tesselate in the GS.
    -- next follow below

for each path:
    figure out the number of vertex along the path with a tesselating algorithm (don't need a large resolution here). You don't need to tesselate here
    in the vertex shader, tesselate and rasterize the vertex you've sent using the same algorithm (Bezier ?) as what you're doing on the CPU. Export all the modified vertex.
    The output of this step is like the EdgeTable array, but in a format the GPU can understand
    --  this part is common to GS/VS
    In the fragment shader, you'll do the final rasterization using the vertex coming from the previous shader.

I’m not sure it’s worth transfering back the transformed VBO, but it’s an idea to test, so you only do it once per path rendering.

Using GL_BLEND allows you to actually do antialiasing with no overhead, and it’s “pixel perfect”, but you’ll divide the possible frame rate (not sure it’s an issue).

I couldn’t find where is the triangulation in the Edgetable code, so I can’t tell much more.

1 Like

Thanks cyril, but I think you’re under-estimating how well I understand it, and how deeply I’ve already investigated this stuff. Most of what you say are things that I’ve already tried, and which failed to work as you’d expect.

  1. I actually don’t use stencil or depth buffers in the 2D renderer. The options are for other people to use if they need them.
  2. I don’t use VBOs or PBOs (?) I use FBOs a bit, but since their performance can be very poor, I now avoid them.
  3. Paths are re-entrant, with non-zero winding rules - that means that tesselating them is incredibly complicated, and not something that can be done by a vertex shader. And to convert a path into a format that a vertex shader can handle would be just as difficult as building an edge table. BUT… even if it could be triangulated on the GPU, there’s no point, because there’s no efficient way to anti-alias the result. The only way I found that could anti-alias a triangulated path was to accumulate it into an intermediate framebuffer, and then use that as a texture. But that failed miserably because switching the GL target to a framebuffer blocked the rendering pipeline and was a total showstopper in terms of performance, so I scrapped all that code, after wasting a lot of time on it… It also looked like shit.
    And yes, fragment shaders might make a small improvement to the GPU time needed for drawing gradients, but in fact that’s not where the bottleneck is anyway, so it wouldn’t help with the current performance problems at all!

You certainly have more knowledge about the current rendering than I do.
For antialiasing, I was thinking of using a Blend mode. You’ll then ignore the anti-aliasing issue, letting the GL primitive blend with the current rendering buffer.
That way you’ll draw in your fragment, and the color will be blended magically by the hardware.

You probably already know it, but this is very useful:

I wonder if the polygon stuff in OGL is enough for the current path code, but maybe it’s doesn’t worth it ?

You might want to have a look here too:
http://code.google.com/p/skia/source/browse/#svn%2Ftrunk
Check the file SkConcaveToTriangles.cpp and the gpu folder

[quote]For antialiasing, I was thinking of using a Blend mode. You’ll then ignore the anti-aliasing issue, letting the GL primitive blend with the current rendering buffer.
That way you’ll draw in your fragment, and the color will be blended magically by the hardware.
[/quote]

Eh?? Blending != anti-aliasing. Are you confusing the terms “semi-transparency” and “anti-aliasing”?

And yes, I looked at the skia triangulation stuff, but it didn’t handle re-entrant paths with implicit holes. So I spent a couple of days writing a cunning triangulation algorithm of my own that did work correctly, but it was a waste of time because even when you’ve got the triangles, there’s no way to actually render them with AA. Honestly, it’s all a total pain!

Im no OGL export, and you probably already tried this, but anyways…

Its a kind of well established that you need to avoid texture swapping, and pre-bake textures into atlases, etc, to have just a very few draw batches, especially on iOS/Android.

So in the just world of Juce, and general path rendering, maybe it would be possible to do all complex path rendering on one approporately big static render-texture (FBO), multisampled/supersampled by some really big factor- maybe 8?

No, like I said, I did it with framebuffers originally, and it was like hitting a brick wall in terms of performance. The fastest way I found to get new data into the pipeline was by creating a new texture, because that can be done without swapping the rendering target.

Sorry its late here… but that sound just wrong. Creating new textures will cause swapping textures, and that going to slow it.
Having one static texture for rendering paths (clearing it before rendering a path) should avoid swapping.

The only knowledge I have about keeping things efficient with rendering (apart from doing less!) is to minimise switching of shader settings. i.e. if you’re drawing 3000 quads, with each textured (randomly) with one of 3 (different) textures, the quickest way to draw them would be to have them sorted into 3 sets and render them in 3 batches. Of course, this would also require a z buffer to ensure that they ultimately overlap properly, as they would not be getting drawn in the back-to-front order.

of course, it’s easier to think about stuff like that with games than a general purpose ui framework!

and i doubt that is of any use [and has already been considered] :slight_smile:

I thought so too, which is why my first attempt was to use a framebuffer. But like I said, in practice it’s many times faster to create a new texture and upload it than to mess about swapping to a framebuffer and issuing draw commands.

And for a lot of the drawing that gets done, the sizes are fairly small, so to send a small mono texture to the GPU is often actually less data than passing it a massive list of triangles + commands to draw them.

Yeah, I wish that was the kind of optimisation I could do! I’ve got a few ideas left that might reduce the number of GL function calls and let the GPU work on larger batches of triangles, but I’ve been finding that whatever intuitions I have about what should be more efficient never seem to match reality!

No. What I was thinking was something like this:

  1. enable blending so the fragment accumulate on both alpha & colors
  2. rasterize your edgetable to your shader with a no offset (0).
  3. In your fragment shader/GS compute the actual rasterization with a pixel grid that’s more dense than the actual output.
  4. Then compute the actual alpha sum for each final destination pixel, and emit that color, multiplied by the alpha value resulting from the sum.
    It’s like doing multisampling in the fragment shader/GS.

Since the alpha + color will blend with the previous output, you’ll get anti-aliasing effect (in fact, it’s a pseudo anti-aliasing, but I wonder it’ll look as good as real AA).
Or you have this option too: http://visual-computing.intel-research.net/publications/papers/2009/mlaa/mlaa.pdf
Implementation here: http://cgit.freedesktop.org/mesa/mesa/commit/?id=6571c0774af1f5ebd0fab40bf4769702d3c9ded5 and here: http://visual-computing.intel-research.net/publications/papers/2009/mlaa/testMLAA.zip
Also the FXAA technic might be useful: http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=302326 (see the shader code to test, and also, the 2 links in the final posts)

Also you can also accept that AA is not absolutely required, when enabling OGL rendering.
Or, stupid question, did you check WGL_SAMPLE_BUFFERS GLX_SAMPLE_BUFFERS ?

OBO’s remark sound correct to me too. In the VideoComponent I’ve written, I’m uploading YUV data to 3 differents textures, and using a double-buffering technic (6 actual textures used). The buffer I’ve mapped is not the one currently used in the pipeline, and on glEnd, I’m swapping the textures for the next rendering. This comes with absolutely no performance hit. But you’re not using OGL for path rendering in this technic.
If you try to map a buffer that’s being used, you’re going to hit a pipeline stall, but I’m sure you’ve already thought about that.

[quote]1) enable blending so the fragment accumulate on both alpha & colors
2) rasterize your edgetable to your shader with a no offset (0).
3) In your fragment shader/GS compute the actual rasterization with a pixel grid that’s more dense than the actual output.
4) Then compute the actual alpha sum for each final destination pixel, and emit that color, multiplied by the alpha value resulting from the sum.
It’s like doing multisampling in the fragment shader/GS.[/quote]

Ok, that sounds interesting. My knowledge of shaders is very very sketchy, but I thought that the way it worked was that a fragment shader only has access to a single pixel, so that it’s not possible to calculate the sum of more than one pixel?

Yes, what you’re doing sounds like the same as my current implementation. Uploading data to a new texture does seem pretty efficient.

You’re right, but it’s not an issue.
You need to set up a FB that’s larger than the required area (for example twice as large in both width and height).
In the FS, you’ll do

// uniform width and uniform height required on top

    if (gl_FragCoord.x > width / 2 || gl_FragCoord.y > height / 2) 
       // We don't care about the result here
       return;

    float realWidth = 1 / (width * 2), realHeight = 1 / (height * 2);
    vec2 topLeft = vec2(gl_FragCoord.x - 0.5 * realWidth, gl_FragCoord.y - 0.5 * realHeight);
    vec2 topRight = vec2(gl_FragCoord.x + 0.5 * realWidth, gl_FragCoord.y - 0.5 * realHeight);
    vec2 bottomLeft = vec2(gl_FragCoord.x - 0.5  * realWidth, gl_FragCoord.y + 0.5 * realHeight);
    vec2 bottomRight = vec2(gl_FragCoord.x + 0.5 * realWidth, gl_FragCoord.y + 0.5 * realHeight);
    glFrag_Color = texture2D(tex,topLeft) * 0.25 + texture2D(tex, topRight) * 0.25 + texture2D(tex, bottomLeft) * 0.25 + texture2D(tex, bottomRight) * 0.25;

Gaaahhh… For the millionth time on this thread: writing to intermediate framebuffers is NOT AN OPTION! I already had some great code that drew nicely AA polygons into a normal-sized framebuffer without shaders or anything fancy. That wasn’t the problem, it was the swapping between framebuffers that made it unusable.

What I need to speed it all up would be some trick that I could use to send a bunch of triangles directly to the screen, and have them drawn with anti-aliasing, straight into the target - NOT involving a f**king framebuffer!!

You can write to a FBO that’s attached to the render buffer. So you don’t have to swap them.
I don’t know what you’ve tried. Did you see this: http://www.songho.ca/opengl/gl_fbo.html
Clearly their FBO demo code is actually faster than without FBO.

Anyway, in that case, what about MLAA or FXAA (see my previous post).
The former applies AA on the final image without requiring you to do any AA rendering at all, the later is doing AA per primitive.

Unfortunately I don’t have much spare time for the next weeks, but I’d have loved to see what you came with, and let me try your tries by myself to see what could be the issue.