I have just managed to build my project on my new Mac mini m4 pro, and in release I find the drawing to be incredibly slow, my RTA and my meter display is like 3FPS. I understand that on ARM based Macs it should be Metal used for acceleration instead of OpenGL, but I can’t find any documentation about how this could be achieved. Any help would be appreciated.
Have you tried running your project under a time profiler to find out where your program is actually spending its processing time? There’s a profiler included in Instruments which is shipped as part of Xcode.
It should definitely be possible to draw some smooth audio meters without needing to use Metal or OpenGL, assuming you’re not applying fancy effects like blurs or shadows to the meters themselves. My initial thought is that your program may be doing more work than is necessary, e.g. redrawing the entire UI rather than just the meters. There’s also a chance that the slowdown is coming from somewhere else entirely.
The app is not doing anything too fancy, just what you would see from any EQ plugin like FabFilter ones, it draws EQ curves, a real-time RTA curve - some gradients are used here -, and a multi-channel peak meter. Should be redrawn 30 times every second when there is something to display. And it is very fast on an Intel Mac that uses an old Vega64 graphics card, it is completely fluent. I have spent quite a lot of time already profiling it, and removed a lot - if not all - unnecessary drawing routines. Currently it only builds if I link with juce::juce_opengl, because I have some actual drawing modules that display 3D stuff, so there are references to things like GLUint in the code. Do I have to get rid of the juce_opengl library?
I shouldn’t think so, OpenGL can still run pretty quickly on Arm macs. I’m surprised that the performance is better on the Intel Mac. Are you keeping all other variables the same, such as building in Release mode for both platforms, and using the same audio block size and sample rate? Are you using the same display (or at least, display density) in both cases? If the M4 machine is using a high-resolution display and the Intel machine is not, the M4 might end up processing 4x the number of pixels, or more.
With some help from the copilot I added some optimization arguments to cmake, it did improve things a little bit but it is still far from acceptable. Now drawing is smooth in a normal sized window, but if I make the window full screen it is still laggy. I can hide my UI using a taskbar menu option, which completely ignores all drawing related calls, and when I do that my CPU usage is around 13.4% - which is perfectly in line with what I was expecting from the M4 CPU. My intel was around 40-50% without a UI, and around 80-90% with a UI. However, with the M4, if I turn on using the UI, the CPU usage skyrockets to 140% - which just shows that there is no GPU acceleration involved in drawing at all, everything is done by the CPU cores.
So can you confirm that there is no Metal acceleration in Juce? This bit was not clear to me from your answers.
To answer your other questions, everything else with the display settings are the same, but if we have to start looking at things like screen resolution that also makes me think that we are only trying to improve CPU drawing. For a GPU this should not matter.
It’s complicated. JUCE doesn’t currently have a way to talk to Metal directly. There’s no Metal-backed Graphics context, but there is a CoreGraphics-backed context which tends to be faster than the software-only renderer. CoreGraphics may use Metal under the hood, but that’s an implementation detail. I believe that OpenGL is implemented as a thin layer over Metal on Arm macs, too, so if you use an OpenGL-backed context then this will end up executing Metal functions internally - but again, that’s an implementation detail. In short, I’d expect the OpenGL and CoreGraphics contexts to both receive some level of hardware acceleration that will normally result in improved performance over JUCE’s software-only renderer.
Not necessarily. If you’re using Image objects to store the results of drawing operations, this may incur performance penalties from having to flush GPU textures into main memory. If the image sizes are based on the size and resolution of the interface, then using a large high-resolution display could cause you to exhaust the available bandwidth to transfer all of this image data, resulting in slowdowns. i.e. the program might be using predominantly hardware-accelerated drawing but still be bottlenecked by the CPU or by memory bandwidth for some reason.
As I mentioned previously, the best way to track down performance issues is to use a profiler. Removing unnecessary function calls will help, but if the performance is still unnacceptable then you can use the profiler to determine where the program is spending its time. Then, you’ll know where to focus your efforts.
Are your Intel and Arm machines both running the same version of macOS? If not, what versions are they each running? Some newer versions have a known issue where the invalid/dirty region of the screen is misreported, causing the entire window to be invalidated on each frame, which may result in worse performance than macOS versions that correctly report the dirty screen region. This is only a guess, so again, I recommend profiling to work out what’s really going on.
A bit of an update on this issue, I have abandoned the drawing part for a while, and now I am looking at it again. I have dug deeper into this issue, now I can confirm that OpenGL is used when painting, but this did not speed up my drawing unfortunately. I am even using the renderOpenGL function instead of paint, but that’s not helping either.
The problem is that the Path class still uses CPU when rasterising paths, and the larger the screen area the slower rasterising becomes. I have no clue why this is only a problem on Apple silicon though, and why is it very fast and smooth on 7 years old Intel processors. My M4 Pro should be running circles around my Cascade Lake core i9. I am assuming that the Path class is using the CPU on Intel platforms too. My requirements aren’t even that crazy, I want to paint a frequency spectrum graph using Path, which has 300 points horizontally, that’s 300 lines that needs to be calculated and rasterised 30 times a second. I also want it to be a filled closed polygon, and I apply smoothing on it using the createPathWithRoundedCorners call. I can clearly see that if I make my render window smaller, the drawing speeds up and becomes smooth, and if I maximise the window it slows down very badly, as the area for rasterisation increases.
Have you tried building with the JUCE_ENABLE_REPAINT_DEBUGGING
macro enabled? This will quickly show whether your plug-in is redrawing portions of the screen that should not be redrawn (i.e. those portions are not changing).
Thanks for the idea, I didn’t know about this macro, and I will definitely try it. Currently it is not doing anything, probably because I am already using renderOpenGL() instead of paint(). Right now I am trying to do some instrumentation to spot the worst offenders. This still does not explain though why it was running smoothly on intel. The main logic of the application and the drawing routines did not change, and I have already gone through an optimisation and profiling session where I made sure that nothing unnecessary is being called. It was running at around 40% cpu on Intel, that includes audio processing and filtering, and drawing. Now with my M4 Pro I am at around 140%.
IIRC JUCE will tell the core-graphics about the path information and core-graphics will decide how to rasterize paths. You may look at a simple project that I am developing. It can draw about two FFT analyzer results (both about 250 points) at 60 H. See GitHub - ZL-Audio/ZLSplitter: splitter plugin
Thanks for your reply. I am not sure what you are talking about - your code does not mention OpenGL at all. You are just using juce::Path, with juce::Path::startNewSubPath and juce::Path::lineTo.
If you are only drawing a frequency spectrum or several filter responses, juce::Path would be more than enough (might be a bit problematic on Linux).
If you want to use OpenGL (for complex gradient or 3D rendering), I would recommend writing GLSL directly instead of playing with juce::Path.
And if you do have some spare time, GitHub - VitalAudio/visage: C++ UI library meets creative coding might also worth a try.
Thanks for the suggestions, I will definitely take a look. You see my issue is that I am trying to paint a 303 sided closed polygon 30 times a second, which should not be a problem for any modern CPU even if I used high resolutions. That’s around 600 triangles if converted to GPU drawing. Ok, I am using filled polygons with transparency, but if it was really using GPU then this should not be an issue. On intel platform my Path drawing uses 40% CPU (per core), on M4 Pro it uses 140%. And my M4 Pro’s single core Cinebench result is more than twice of my Intel CPU’s, and my Radeon Vega 64 GPU has 70% the power of my M4 Pro’s GPU power. On Intel, JUCE could effortlessly draw the whole shebang with OpenGL. Now if I make my view full screen, the FPS goes down to like 2. I do use high DPI screen, so my actual resolution is twice of what is displayed, so my screen is in fact 5K - but I can run Heaven benchmark in parallel with my application, and it is not breaking a sweat while my app struggles.
@zsliu98 is right. Instead of forcing the use of hardware acceleration by using OpenGL components, DON’T use OpenGL and let Core Graphics take care of the hardware acceleration.
You will get better performance. We draw more complex FFTs, with more points and have no problem even at 60 fps.
I’ve been using compute shaders more and more for graphics stuff recently. The final rendering is still done by the native APIs but you offload a lot of the calculations to a compute shader. Obviously depends on what it is you’re drawing, but for things like paths where you might have thousands of points, it’s often a lot more performant to have the GPU do the calculations rather than the CPU.
E.g. I have a spectrum analyser that still uses juce::Path
and is drawn with juce::Graphics
, but the points of the path are computed on the GPU.
What API do you use for the compute shaders, just OpenGL?
How are you calculating the path points?
Matt
I calculate the FFT and the corresponding juce::Path on a background thread and swap it with the acutal painted paths within a lock/try lock. How do you sync the paths back to the message thread? Would love to know which one is more suitable for such task: a background thread or compute shaders?
This. You guys were right all along. Finally I made it working the way you have been suggesting.
I’ve read your comment, put on my most skeptical face - because, you know, intel was fast, etc….
Then I commented out every bit of code that was referencing OpenGL or OpenGLContext in any way, and moved my painting back to paint(). And now it just works, it is fast and smooth, the way it should be.
I guess I’ve had that bit of code in my component constructor that attaches to an OpenGL context all along, since I moved to M4 from Intel, and while I though it is giving me an advantage, in fact it was preventing Juce from doing its thing.
Thanks for the help guys.
No I’m using Metal on macOS and starting to look at Direct3D/HLSL for Windows. I haven’t really looked deeply into OpenGL compute shaders, but some quick Googling suggested using the native APIs would be best.
I pass a buffer of input samples into the compute shader, and generate a buffer of frequency responses using a DFT. Then just iterate over that buffer of responses to build the path.
FFT doesn’t make sense with a compute shader since it needs to be calculated in series as it’s a recursive algorithm. DFT is not only much simpler but can also be calculated in parallel.
The way I have it set up, the message thread is blocked and waits for the compute shader to finish. From the perspective of the call site you wouldn’t know a compute shader’s involved as the message thread hardly has to wait any time at all for the shader to complete.
Using a background thread is good if you just want to unblock the message thread, but still puts all the load on the CPU. I went for the compute shader approach because I wanted to minimise CPU usage.