Juce's not quite ready OpenGL Renderer

You’re right, but it’s not an issue.
You need to set up a FB that’s larger than the required area (for example twice as large in both width and height).
In the FS, you’ll do

// uniform width and uniform height required on top

    if (gl_FragCoord.x > width / 2 || gl_FragCoord.y > height / 2) 
       // We don't care about the result here
       return;

    float realWidth = 1 / (width * 2), realHeight = 1 / (height * 2);
    vec2 topLeft = vec2(gl_FragCoord.x - 0.5 * realWidth, gl_FragCoord.y - 0.5 * realHeight);
    vec2 topRight = vec2(gl_FragCoord.x + 0.5 * realWidth, gl_FragCoord.y - 0.5 * realHeight);
    vec2 bottomLeft = vec2(gl_FragCoord.x - 0.5  * realWidth, gl_FragCoord.y + 0.5 * realHeight);
    vec2 bottomRight = vec2(gl_FragCoord.x + 0.5 * realWidth, gl_FragCoord.y + 0.5 * realHeight);
    glFrag_Color = texture2D(tex,topLeft) * 0.25 + texture2D(tex, topRight) * 0.25 + texture2D(tex, bottomLeft) * 0.25 + texture2D(tex, bottomRight) * 0.25;

Gaaahhh… For the millionth time on this thread: writing to intermediate framebuffers is NOT AN OPTION! I already had some great code that drew nicely AA polygons into a normal-sized framebuffer without shaders or anything fancy. That wasn’t the problem, it was the swapping between framebuffers that made it unusable.

What I need to speed it all up would be some trick that I could use to send a bunch of triangles directly to the screen, and have them drawn with anti-aliasing, straight into the target - NOT involving a f**king framebuffer!!

You can write to a FBO that’s attached to the render buffer. So you don’t have to swap them.
I don’t know what you’ve tried. Did you see this: http://www.songho.ca/opengl/gl_fbo.html
Clearly their FBO demo code is actually faster than without FBO.

Anyway, in that case, what about MLAA or FXAA (see my previous post).
The former applies AA on the final image without requiring you to do any AA rendering at all, the later is doing AA per primitive.

Unfortunately I don’t have much spare time for the next weeks, but I’d have loved to see what you came with, and let me try your tries by myself to see what could be the issue.

Yes, of course I understand how framebuffers work!!

The framebuffer stuff doesn’t appear slow, but when you’re drawing hundreds of paths per update, that means hundreds of framebuffer switches, and my app was spending almost all the CPU time sitting inside either either glClear or the function that binds a new framebuffer, presumably flushing the pipeline before continuing. Other drivers may handle that situation better, but I’m using a top-end MacBook with a good Nvidea GPU, so if something doesn’t work here, it’s clearly not a workable design.

And yes, of course MLAA and FXAA were the first things I looked into, but they’re crappy quality, and unavailable on a lot of drivers. It might be possible on high-end smartphones to just use MLAA, just because the resolutions are so high that it doesn’t matter, but I want something that works in general too.

That’s the part I don’t get. Why do you render to different framebuffer, and not only one using a blend function ?

That’s the part I don’t get. Why do you render to different framebuffer, and not only one using a blend function ?[/quote]

For what feels like the thousandth time, this is what I was doing:

  1. Bind the framebuffer
  2. Clear it
  3. Accumulate/jitter the polygon into this buffer to build up a (mono) anti-aliased mask
  4. Switch back to the original target context/screen
  5. Select the framebuffer as a texture, and use it as a mask to blend the appropriate colour/gradient/image/whatever onto the screen

Here is a very unorthodox idea, that almost avoids swapping rendertargets:

  1. Have a screen buffer in the upper half of a texture, and a drawing area in the lower half.
  2. Do all render work in the lower half of the texture (here you can even use multi/super-sampling)
  3. Copy each render operation work to upper area.
  4. At the end of each frame, copy the upper part to screen (either by one rendertarget swapping pr frame or by reading pixels from video memory and copying to juce software renderer).

I like your lateral-thinking, but is it even possible to use a framebuffer as a texture when drawing onto itself!?

I also suspect that it’d still block the pipeline at some point, probably when clearing the temporary half, ready to start rendering the next polygon.

[quote=“jules”]

  1. Bind the framebuffer
  2. Clear it
  3. Accumulate/jitter the polygon into this buffer to build up a (mono) anti-aliased mask
  4. Switch back to the original target context/screen
  5. Select the framebuffer as a texture, and use it as a mask to blend the appropriate colour/gradient/image/whatever onto the screen[/quote]

Sorry to disappoint you.
But I still clearly don’t get why you’re doing 2 passes where only one is needed. You should merge step 3 and 5 into the same operation, and there’s no need for step 4.
I’m probably repeating but you should:

  1. Clear the framebuffer
  2. Bind the framebuffer as a RGBA render texture/target that’s twice the final size
  3. Set a blend function, so the multiple call below actually accumulate via alpha MODULATE (or whatever appropriate)
    – Graphic loop for each polygon
  4. Accumulate/jitter the polygon and perform the appropriate colour/gradient/image/whatever blending.
  5. Repeat step 3 until there’s no more polygon. (Ideally use a VBO to send all polygons at once and do the whole operation in a single pass for all polygons).
    – End of graphic loop
  6. Select the render target and project the FB as texture onto a quad the size of the viewport. Antialiasing like behaviour will happen with a GL_LINEAR texture mapping.
    Obviously you want to disable mipmap generation else you’ll loose the last part interest.

Also, if the above doesn’t work, you might be interested in multitexturing too. If you bind 2 FB, one for the mask stuff and the other for the image/colour/whatever, you can mix them in the final render FS in one pass. So you actually don’t have to switch FB often, only twice per rendering loop.

Did you try using the z-buffer or stencil buffer for storing the mask ? That way you can probably avoid doing FB selection too.

FFS!! For what must be the MILLIONTH time, let me say YET AGAIN:

THE PERFORMANCE IS CRAP WHEN YOU DRAW TO A FRAMEBUFFER AND THEN USE IT AS A TEXTURE!!!

Why do you keep on giving me suggestions that still involve doing EXACTLY THAT!!?? It doesn’t make any difference how big the buffer is or what blend function you use - it’s the fact that there a framebuffer is involved AT ALL that’s the problem!!

I already use multitexturing! And switching framebuffer ONCE per polygon is too much!

My current implementation is really cunning in that it uses multitexturing with no framebuffers to do all the gradients and image fills, but it does still require the polygon mask to be supplied as a texture.

Now that’s more like the kind of useful suggestion I was hoping for!

Either of those may be potentially possible, but I think the only way they could be used would be if it’s possible to write a fragment shader that uses their content in a non-standard way, treating it as an alpha channel. But, I don’t know enough about what fragment shaders can do to be able to say if that’s even possible.

Maybe it could be done in two passes -

  1. Accumulate/jitter the polygon using a custom shader that only writes to the stencil or Z buffer, treating it like an alpha channel and building up a decent mask.
  2. Use a second custom shader that treats the stencil/Z value as the alpha when blending a quad that contains the fill colour or gradient.

But I’ve never written a shader, so don’t know what’s possible with that kind of thing…

(And of course, the target context may not have an 8-bit stencil or depth buffer…)

Ok. I got it. It’s opposite to what I’m measuring here on my PC, and to the example code link I gave you from http://www.songho.ca
The difference, I see between your description and my experience, is that I wouldn’t use 1 FB per polygon but one for all polygons, and one switch per polygon but one per rendering loop.
I would use as much texture as possible for passing the image/color/gradient whatever so I minimize the number of rendering stage.

Anyway, you probably have tested it on Android or iOS which I haven’t so I’ll take your advice as granted on that platforms.

Back to the zbuffer or stencil idea, you can have a look to gl_FragDepth and gl_Color / gl_SecondaryColor inputs to the fragment shader and how to write them.
Concerning the stencil buffer, I don’t know how to actually write to it without using any FB (sorry). You’ll probably have to do multipass rendering (rendering to the stencil buffer once for all your polygon). But I never did it before.
The link on songho have an example of that, you might be interested.

I was only using a single FB, but for it needs to be re-used for each path or edgetable, so that means hundreds of switches for a typical window of content.

And like I said, this is all on my MacBook with a good nvidea graphics card. I didn’t get as far as android/iOS, but they’re unlikely to be better. I’ll have a look at the fragment shader stuff, I should probably learn how to do that anyway.

ok, now I understand why you are switching so often. Correct me if I’m wrong. For each path to draw you actually select the mask fb to draw the used mask then you’re selecting it to draw the final path. You have a ‘rendering loop’ per component to draw.
In that case, I think the issue is not in Ogl at all.
Starting a rendering in Ogl is a long process that’s usually defered so you don’t see the effect. But in your case, the driver has so many rendering to do that you now see when the pipeline stall.

The good news is that you probably don’t need to throw what you’ve done. The bad news is that you’ll need to change the way the code work so that you are now accumulating all the graphic operations in a stack (ideally hierarchized ), and have a single rendering for all the components at once. No more clipping, no area to invalidate.
One major remark about Ogl is that it’s scene based, and it’s optimized for this use case. Using it to do canvas like calls is not efficient like you are seeing.

It might seem like a no true scottsman fallacy, but rewritting the graphic code so that all operations are accumulated in a ‘stack’ might actually improve performance on all platform, including 2d one, because you should be able to detect a bench of similar operations (90% of usual gui are similar operations) and use a cached version instead.

Yes! But it’s not a rendering loop per-component, it’s a rendering loop for every path/edge table drawing.

No shit!

Yeah, that’d be either impossible or massively difficult, and wouldn’t make any difference to the problem I was asking about, which is how best to draw these damned polygons!

Canvas drawing is by its nature dependent on the order in which operations are performed, so you can’t just batch together similar operations, as they might overlap intermediate results.

I don’t actually think there’s any reason why GL couldn’t do a really efficient job of this, it’s just a matter of finding the best way to feed it data so that it can process it without getting its pipeline blocked.

I think is not talking about batching together similar operations but accumulate all operations
using their description in a stack, then process the whole stack in one big batch

If I remember correctly, this is what is done in nui http://www.libnui.net

What do you all mean by “stack”?? There’s no magic secret “stack” I could use in openGL. And when you’ve got code that’s making hundreds of calls to draw a bunch of random, overlapping, semi-transparent paths, edge-tables, etc, it’s simply not practical to remember all of that, sort it into some kind of “stack” (whatever that means…) and then process it all at once. And even if you did, how much of it could be optimised? My guess is not very much at all, because the order in which operations happen can’t be changed.

And AFAICT by looking at their source code, the nui renderer doesn’t do anything special, it just avoids the whole AA problem by relying on hardware multisampling if it’s available. If I had taken that route, I wouldn’t be seeing any performance problems either, but it’d all look a bit crap. I just think it’s possible by being a bit more cunning to have the best of both worlds - make it look good and draw fast too.

the stack is not openGL trick per se.
Just a c++ stack where you store operations in order to just one big drawing at the end.

[quote=“otristan”]the stack is not openGL trick per se.
Just a c++ stack where you store operations in order to just one big drawing at the end.[/quote]

Yes, I assumed that’s what you meant, but you’re all really not grasping the problem!

a) it’d be very difficult and CPU-intensive to keep one of these lists. E.g. what happens if someone draws an image, then changes the image, and draws it again, 100 times…? Does the stack keep 100 copies of the image in all its states? What if they draw 100 paths, does it keep copies of all those 100 path objects? Maintaining this stack would probably involve almost as much work as drawing it!

b) It wouldn’t help! The order of the drawing operations cannot be changed because they may be layered, so it would only be possible to batch together consecutive operations which use exactly the same paths/gradient/image etc. That kind of optimisation can already be achieved very simply by caching the last texture, etc, and I already do a lot of that!

Even if before starting to render, there was a perfect list of all the operations that needed to happen, it’d still be impossible to convert that list into a sequence that would avoid the issues I’ve been talking about… It really wouldn’t help at all!

For example, let’s say you’ve such component hierarchy:
TopLevelWindow (with close button) - ignore by OpenGL renderer
ContentComponent
|–> A Label
|–> Some image
|–> A drawable

You’ll probably have to do something like this:
ContentComponent::draw() => Append new ClipRegionOperation(ContentComponent) to the internal OGL stack
draw call image draw => Append a ImageDrawCommand to the ClipRegionOperation stack – remembering the parameters, for example in a hash
draw the drawable => Append one/multiple PathDrawCommand to the ClipRegionOperation stack – remembering the parameters, for example in a hash
Then call the Label::draw() => Append new ClipRegionOperation(Label) to the Component’s ClipRegionOperation stack
draw the text => Append the multiple PathDrawCommand to the ClipRegionOperation stack – remembering the parameters, for example in a hash

Then, and that’s the interesting idea:

  • Create 2 textures or 1 texture + stencil buffer
  • Walk down the stack.
  • Each time you find a ClipRegionOperation, you’ll need to set up the GL viewport to the clip area. Then issue your path drawing operations to the application level textures. (that with no acceleration whatsoever)
  • When you’re back from a child’s stack operation, you’ll need to restore the previous clip area.
  • When you’re done with all the stack operations, you need a final shader doing the masking from your two textures.

To accelerate this, you can decide to store 2 textures per ClipRegionOperation, in order to:

  • Avoid rendering again that part (use the cached textures instead and use the glCopyTexture instead of rendering again)
  • Optimize the number of GL calls, and pipeline stalls

Whenever the graphic code change, the hash will change, and you’ll have to re-render that part.

Am I more clear now ?

Aghh… Complexity overload! Thanks, but I really don’t think that’s a realistic solution!

The thing is, I’ve already got almost all of this completely nailed, it doesn’t need a complete redesign. The only problem I’ve got is how to find a non-stalling way to get an AA polygon over to the GPU for use as a texture. No matter how cleverly you restructure things, or how much you try to cache, that same problem will always come up and need to be solved. But if I can figure it out, then what I’ve got will already work, extremely well!