Juce's not quite ready OpenGL Renderer

otristan · November 7, 2011, 3:43pm

using GLU tesselator ?

http://glprogramming.com/red/chapter11.html

some infos on AA

http://homepage.mac.com/arekkusu/bugs/invariance/index.html

jules · November 7, 2011, 4:03pm

Sadly, no… I already spent a long time writing a very cunning path-to-triangle algorithm, but the only way to reliably render it with AA ended up being slower than just creating a texture from the edge table, which is how it works now.

It’d probably be fast to use triangles if you were happy to just rely on the device’s multisampling to perform the AA, but as far as I can tell the results with that are (at best) piss-poor, or at worst, not anti-aliased at all.

I also found that the quantity of triangle data that needed to be sent to the GPU for some paths was actually not much smaller than the texture data generated from an edge table. And generating an edge table is faster than triangulating. So basically, my conclusion was that unless you’re drawing a big, simple, non-antialised polygon, triangles just make things harder.

I’d be interested to hear from anyone who’s had experience of profiling GL, and could work out where the bottlenecks really are… I suspect that it could all be improved greatly with just a few tweaks in the right places.

X-Ryl669 · November 7, 2011, 4:55pm

Few remarks, by reading at the Git (not tested here, so might worth it):

Why do you need a stencil buffer for ? A depth buffer ?
This slows things down quite a bit, and unless you actually expect Component to draw in 3 dimensions, you better leave that disabled.
You should use VBO / PBO only when the buffer to send to the GPU is actually big enough to justify the overhead.
I know you’ll hate me, but you need to convert your code to run on the GPU, using shaders. Ideally you should pass the list of edgetable’s side as a vertex object, and put your rasterization routine in a geometry shader. That way, you don’t send thousands of triangles to the GPU, but let the GPU do the tesselation itself. You can accumulate in a frame buffer using GL_BLEND operation, so you should be able to do anti-aliasing in the shader.
If you want to limit yourself to OpenGL ES 2.0, then you’ll need to x-plicate the number of vertex to reach the antialiasing goal you’re targeting, and skip the geometry shader stage.

So instead of:
for each path:
   scan-line raster the path to create edgetable
   render edgetable

You should probably have:
If you've geometry shaders, 
    Send the path curves characteristics in a buffer object, and tesselate in the GS.
    -- next follow below

for each path:
    figure out the number of vertex along the path with a tesselating algorithm (don't need a large resolution here). You don't need to tesselate here
    in the vertex shader, tesselate and rasterize the vertex you've sent using the same algorithm (Bezier ?) as what you're doing on the CPU. Export all the modified vertex.
    The output of this step is like the EdgeTable array, but in a format the GPU can understand
    --  this part is common to GS/VS
    In the fragment shader, you'll do the final rasterization using the vertex coming from the previous shader.

I’m not sure it’s worth transfering back the transformed VBO, but it’s an idea to test, so you only do it once per path rendering.

Using GL_BLEND allows you to actually do antialiasing with no overhead, and it’s “pixel perfect”, but you’ll divide the possible frame rate (not sure it’s an issue).

I couldn’t find where is the triangulation in the Edgetable code, so I can’t tell much more.

jules · November 7, 2011, 5:19pm

Thanks cyril, but I think you’re under-estimating how well I understand it, and how deeply I’ve already investigated this stuff. Most of what you say are things that I’ve already tried, and which failed to work as you’d expect.

I actually don’t use stencil or depth buffers in the 2D renderer. The options are for other people to use if they need them.
I don’t use VBOs or PBOs (?) I use FBOs a bit, but since their performance can be very poor, I now avoid them.
Paths are re-entrant, with non-zero winding rules - that means that tesselating them is incredibly complicated, and not something that can be done by a vertex shader. And to convert a path into a format that a vertex shader can handle would be just as difficult as building an edge table. BUT… even if it could be triangulated on the GPU, there’s no point, because there’s no efficient way to anti-alias the result. The only way I found that could anti-alias a triangulated path was to accumulate it into an intermediate framebuffer, and then use that as a texture. But that failed miserably because switching the GL target to a framebuffer blocked the rendering pipeline and was a total showstopper in terms of performance, so I scrapped all that code, after wasting a lot of time on it… It also looked like shit.
And yes, fragment shaders might make a small improvement to the GPU time needed for drawing gradients, but in fact that’s not where the bottleneck is anyway, so it wouldn’t help with the current performance problems at all!

X-Ryl669 · November 7, 2011, 5:38pm

You certainly have more knowledge about the current rendering than I do.
For antialiasing, I was thinking of using a Blend mode. You’ll then ignore the anti-aliasing issue, letting the GL primitive blend with the current rendering buffer.
That way you’ll draw in your fragment, and the color will be blended magically by the hardware.

You probably already know it, but this is very useful:

I wonder if the polygon stuff in OGL is enough for the current path code, but maybe it’s doesn’t worth it ?

You might want to have a look here too:
http://code.google.com/p/skia/source/browse/#svn%2Ftrunk
Check the file SkConcaveToTriangles.cpp and the gpu folder

jules · November 7, 2011, 6:05pm

[quote]For antialiasing, I was thinking of using a Blend mode. You’ll then ignore the anti-aliasing issue, letting the GL primitive blend with the current rendering buffer.
That way you’ll draw in your fragment, and the color will be blended magically by the hardware.
[/quote]

Eh?? Blending != anti-aliasing. Are you confusing the terms “semi-transparency” and “anti-aliasing”?

And yes, I looked at the skia triangulation stuff, but it didn’t handle re-entrant paths with implicit holes. So I spent a couple of days writing a cunning triangulation algorithm of my own that did work correctly, but it was a waste of time because even when you’ve got the triangles, there’s no way to actually render them with AA. Honestly, it’s all a total pain!

OBO · November 7, 2011, 9:08pm

Im no OGL export, and you probably already tried this, but anyways…

Its a kind of well established that you need to avoid texture swapping, and pre-bake textures into atlases, etc, to have just a very few draw batches, especially on iOS/Android.

So in the just world of Juce, and general path rendering, maybe it would be possible to do all complex path rendering on one approporately big static render-texture (FBO), multisampled/supersampled by some really big factor- maybe 8?

jules · November 7, 2011, 9:46pm

No, like I said, I did it with framebuffers originally, and it was like hitting a brick wall in terms of performance. The fastest way I found to get new data into the pipeline was by creating a new texture, because that can be done without swapping the rendering target.

OBO · November 7, 2011, 10:14pm

Sorry its late here… but that sound just wrong. Creating new textures will cause swapping textures, and that going to slow it.
Having one static texture for rendering paths (clearing it before rendering a path) should avoid swapping.

haydxn · November 8, 2011, 9:11am

The only knowledge I have about keeping things efficient with rendering (apart from doing less!) is to minimise switching of shader settings. i.e. if you’re drawing 3000 quads, with each textured (randomly) with one of 3 (different) textures, the quickest way to draw them would be to have them sorted into 3 sets and render them in 3 batches. Of course, this would also require a z buffer to ensure that they ultimately overlap properly, as they would not be getting drawn in the back-to-front order.

of course, it’s easier to think about stuff like that with games than a general purpose ui framework!

and i doubt that is of any use [and has already been considered]

jules · November 8, 2011, 9:23am

I thought so too, which is why my first attempt was to use a framebuffer. But like I said, in practice it’s many times faster to create a new texture and upload it than to mess about swapping to a framebuffer and issuing draw commands.

And for a lot of the drawing that gets done, the sizes are fairly small, so to send a small mono texture to the GPU is often actually less data than passing it a massive list of triangles + commands to draw them.

Yeah, I wish that was the kind of optimisation I could do! I’ve got a few ideas left that might reduce the number of GL function calls and let the GPU work on larger batches of triangles, but I’ve been finding that whatever intuitions I have about what should be more efficient never seem to match reality!

X-Ryl669 · November 8, 2011, 9:34am

No. What I was thinking was something like this:

enable blending so the fragment accumulate on both alpha & colors
rasterize your edgetable to your shader with a no offset (0).
In your fragment shader/GS compute the actual rasterization with a pixel grid that’s more dense than the actual output.
Then compute the actual alpha sum for each final destination pixel, and emit that color, multiplied by the alpha value resulting from the sum.
It’s like doing multisampling in the fragment shader/GS.

Since the alpha + color will blend with the previous output, you’ll get anti-aliasing effect (in fact, it’s a pseudo anti-aliasing, but I wonder it’ll look as good as real AA).
Or you have this option too: http://visual-computing.intel-research.net/publications/papers/2009/mlaa/mlaa.pdf
Implementation here: mesa/mesa - The Mesa 3D Graphics Library (mirrored from https://gitlab.freedesktop.org/mesa/mesa) and here: http://visual-computing.intel-research.net/publications/papers/2009/mlaa/testMLAA.zip
Also the FXAA technic might be useful: Khronos Forums - Khronos Standards community discussions (see the shader code to test, and also, the 2 links in the final posts)

Also you can also accept that AA is not absolutely required, when enabling OGL rendering.
Or, stupid question, did you check WGL_SAMPLE_BUFFERS GLX_SAMPLE_BUFFERS ?

OBO’s remark sound correct to me too. In the VideoComponent I’ve written, I’m uploading YUV data to 3 differents textures, and using a double-buffering technic (6 actual textures used). The buffer I’ve mapped is not the one currently used in the pipeline, and on glEnd, I’m swapping the textures for the next rendering. This comes with absolutely no performance hit. But you’re not using OGL for path rendering in this technic.
If you try to map a buffer that’s being used, you’re going to hit a pipeline stall, but I’m sure you’ve already thought about that.

jules · November 8, 2011, 9:49am

[quote]1) enable blending so the fragment accumulate on both alpha & colors
2) rasterize your edgetable to your shader with a no offset (0).
3) In your fragment shader/GS compute the actual rasterization with a pixel grid that’s more dense than the actual output.
4) Then compute the actual alpha sum for each final destination pixel, and emit that color, multiplied by the alpha value resulting from the sum.
It’s like doing multisampling in the fragment shader/GS.[/quote]

Ok, that sounds interesting. My knowledge of shaders is very very sketchy, but I thought that the way it worked was that a fragment shader only has access to a single pixel, so that it’s not possible to calculate the sum of more than one pixel?

Yes, what you’re doing sounds like the same as my current implementation. Uploading data to a new texture does seem pretty efficient.

X-Ryl669 · November 8, 2011, 11:21am

You’re right, but it’s not an issue.
You need to set up a FB that’s larger than the required area (for example twice as large in both width and height).
In the FS, you’ll do

// uniform width and uniform height required on top

    if (gl_FragCoord.x > width / 2 || gl_FragCoord.y > height / 2) 
       // We don't care about the result here
       return;

    float realWidth = 1 / (width * 2), realHeight = 1 / (height * 2);
    vec2 topLeft = vec2(gl_FragCoord.x - 0.5 * realWidth, gl_FragCoord.y - 0.5 * realHeight);
    vec2 topRight = vec2(gl_FragCoord.x + 0.5 * realWidth, gl_FragCoord.y - 0.5 * realHeight);
    vec2 bottomLeft = vec2(gl_FragCoord.x - 0.5  * realWidth, gl_FragCoord.y + 0.5 * realHeight);
    vec2 bottomRight = vec2(gl_FragCoord.x + 0.5 * realWidth, gl_FragCoord.y + 0.5 * realHeight);
    glFrag_Color = texture2D(tex,topLeft) * 0.25 + texture2D(tex, topRight) * 0.25 + texture2D(tex, bottomLeft) * 0.25 + texture2D(tex, bottomRight) * 0.25;

jules · November 8, 2011, 11:42am

Gaaahhh… For the millionth time on this thread: writing to intermediate framebuffers is NOT AN OPTION! I already had some great code that drew nicely AA polygons into a normal-sized framebuffer without shaders or anything fancy. That wasn’t the problem, it was the swapping between framebuffers that made it unusable.

What I need to speed it all up would be some trick that I could use to send a bunch of triangles directly to the screen, and have them drawn with anti-aliasing, straight into the target - NOT involving a f**king framebuffer!!

X-Ryl669 · November 8, 2011, 12:49pm

You can write to a FBO that’s attached to the render buffer. So you don’t have to swap them.
I don’t know what you’ve tried. Did you see this: http://www.songho.ca/opengl/gl_fbo.html
Clearly their FBO demo code is actually faster than without FBO.

Anyway, in that case, what about MLAA or FXAA (see my previous post).
The former applies AA on the final image without requiring you to do any AA rendering at all, the later is doing AA per primitive.

Unfortunately I don’t have much spare time for the next weeks, but I’d have loved to see what you came with, and let me try your tries by myself to see what could be the issue.

jules · November 8, 2011, 2:01pm

Yes, of course I understand how framebuffers work!!

The framebuffer stuff doesn’t appear slow, but when you’re drawing hundreds of paths per update, that means hundreds of framebuffer switches, and my app was spending almost all the CPU time sitting inside either either glClear or the function that binds a new framebuffer, presumably flushing the pipeline before continuing. Other drivers may handle that situation better, but I’m using a top-end MacBook with a good Nvidea GPU, so if something doesn’t work here, it’s clearly not a workable design.

And yes, of course MLAA and FXAA were the first things I looked into, but they’re crappy quality, and unavailable on a lot of drivers. It might be possible on high-end smartphones to just use MLAA, just because the resolutions are so high that it doesn’t matter, but I want something that works in general too.

X-Ryl669 · November 8, 2011, 2:20pm

That’s the part I don’t get. Why do you render to different framebuffer, and not only one using a blend function ?

jules · November 8, 2011, 2:37pm

That’s the part I don’t get. Why do you render to different framebuffer, and not only one using a blend function ?[/quote]

For what feels like the thousandth time, this is what I was doing:

Bind the framebuffer
Clear it
Accumulate/jitter the polygon into this buffer to build up a (mono) anti-aliased mask
Switch back to the original target context/screen
Select the framebuffer as a texture, and use it as a mask to blend the appropriate colour/gradient/image/whatever onto the screen

OBO · November 8, 2011, 3:19pm

Here is a very unorthodox idea, that almost avoids swapping rendertargets:

Have a screen buffer in the upper half of a texture, and a drawing area in the lower half.
Do all render work in the lower half of the texture (here you can even use multi/super-sampling)
Copy each render operation work to upper area.
At the end of each frame, copy the upper part to screen (either by one rendertarget swapping pr frame or by reading pixels from video memory and copying to juce software renderer).

Topic		Replies	Views
FR: JUCE Vulkan Feature Requests	65	9288	November 24, 2021
Is Path slow? What is the best tool for spectrum analyser graph? General JUCE discussion	101	5347	August 16, 2024
Questions about juce w/regards to building 3d apps General JUCE discussion	18	2054	December 17, 2013
GUI Drawing Efficiency Audio Plugins	38	8114	December 23, 2018
OpenGLComponent takes up to 90% CPU General JUCE discussion	16	866	June 14, 2009

Juce's not quite ready OpenGL Renderer

Purchase

Discover

Learn

Support

About

Events

Juce's not quite ready OpenGL Renderer

Related topics

Purchase

Discover

Learn

Support

About

Events