Faster Blur? || Glassmorphism UI

OK, so here is the latest version of our stack shadows:

The main differences between my previous link:

  • Improved efficiency by rendering the shadow into a temporary juce::Image
  • Improved inner shadow compatibility with various design tools
  • Fixed some issues with the spread parameter

I think this in combination with the vectorisation performance improvements from @sudara are getting us pretty close to something usable!

6 Likes

Great! Thanks for sharing!

I’m dying to get my module out, but Intel IPP is currently letting me down. I’m seeing up to a 2x speedup when using the equivalent of Apple’s SepConvolve (IPP’s FilterSeperable) but unfortunately performance degrades to be worse than baseline once radii go above 10.

I have a few other implementations I made while learning how to implement StackBlur. On Windows, the most promising is a row by row / col by col vector implementation which provides a consistent 4-5x speedup (at least on my AMD Ryzen 9), however that relies on my IPP vector library wrapper and I’d prefer to not add that as a dependency for the module, so I guess I’d need to rewrite that to be standalone for windows only.

I’ll do a bit more spelunking on IPP as well. There might be a better option there…

I will reiterate that caching shadows (holding a copy of the shadow’s juce::Image for the next paint call) is probably the #1 most performance friendly approach when working with a lot of shadows — so even if I can’t get my high-bar mythical 10x speedup for blurs/shadow on Windows, it will still make my module worth publishing/using on windows.

4 Likes

If only we could guarantee Metal for MacOS, we could bypass all the pain and utilise simple shaders to do all these effects.

I think there are also developers making plugins for Windows

1 Like

I’m primarily a Windows dev. But it would be nice to have hardware rendering enabled for all formats because I don’t feel like supporting OpenGL any more, because of Apple’s hatred of all things third party. Which is a shame as I love shader coding! But that’s another story…:slightly_smiling_face:

1 Like

Lol, I misinterpreted your comment, I read it as coming from a Mac centered world view :face_with_hand_over_mouth:

1 Like

Haha. Yeah man, that’s easily done on the Juce forum.

2 Likes

I just gave this a shot, works well!

I’m using it on an animated stroke path on my analyzer… there’s a bit of a slowdown but not too bad at all.

ScreenRecorderProject98_2

1 Like

Finally managed to get the blur module done! (cross-posting here for future readers)

I also wrote quite a bit about how Stack Blur works and thinking through how to make it faster:

8 Likes

So IPP is only used for a few vector operations here? Like adding two vectors, multiply them, etc.? If that is the case, I’m sure we can come up with a native solution for x64 and Neon without much trouble to reduce dependencies on these extremely heavy libraries.

Yes, the current IPP implementation just uses zero, copy, convert uint8 to float, add, subtract, divide, and their equivalent to addWithMultiply.

IPP isn’t required. The single channel fallback is a FloatVectorOperations implementation (literally just replacing the calls to IPP). On my PC it performs slightly better than Gin, but the caching of course means it just as fast as any other implementation on repaint.

I’d be interested in a SIMD implementation, it would also be useful for pre-macOS 11.0 / iOS 14.0.

We’ll get on that soon. One of our team (@reFXmkamradt) is quite the optimizer. He’ll probably enjoy it :smiley:

1 Like

:slight_smile: I have just glanced over it but I fear it is not going to be so easy. If the “ipp_vector” version is really fast and the “float_vector_stack_blur” version is a lot slower, then this makes me wonder. JUCE’s FloatVectorOperations do use SIMD instructions and for such trivial tasks as “add” or “divide” it should not make too much of a difference.

One thing comes to mind that IPP does and that is to dynamically branch into different implementations, depending on your actual CPU. Maybe on your machine it uses AVX512 for its operations which would reduce loads, stores and register pressure. If that is what makes it a lot faster, we would also need to offer multiple implementations that are dynamically selected on the end-users machine.

I can try to set up a test case and look into it a bit further.

1 Like

Thanks for taking a look!

Yes, that’s pretty much the same reasoning why I didn’t jump into trying to make SIMD version of that particular algorithm. That “flavor” of the vector algorithm is a bit strange, both in that it converts to floats but also that it subjects a whole row (or column) to a series of vector operations via the vendor libs.

A better fit for SIMD is probably going back to the “naive” version, doing the calculation for 8/16 pixels at once (u16) and using something like libdivide for division. I think there’s a SIMD implementation like this floating around the ether too…

But! Thanks to the power of the internet, someone on twitter with an anime profile pic contacted me with exactly what I was looking for this whole time — an alternate take on the actual math behind the algorithm. I had the intuition that the math could be simplified (but not the skills to do it).

I’ve spent some time with it and looks realllly promising: we’d be able to go row by row, col by col, generating 2 sets of “prefix sums”, and then just produce the final pixel component with 2 additions, 1 subtraction and 1 division. It reduces branching, eliminates having to manage the “queue” and everything stays in very nice tight loops. It should also have the same stable performance profile as the radii changes that stack blur has — but hopefully blow it out of the water?

I’m cooking up a naive version of that version to see if it’ll pass correctness. I wasn’t totally sure if the guy from twitter was going to work on a SIMD version, but he sent this link over, which talks about how to parallelize the “prefix sums” (accumulating sums) Prefix Sum with SIMD - Algorithmica . I’ll ping you when i get a naive version passing tests if you are interested!

3 Likes

The other thing re: SIMD that’s nagging at me is the ARGB version of the algorithm. With StackBlur + caching, everything is pretty much happy “enough” outside of big animating dropshadows. But all ARGB algorithms I’ve tried sort of end up performing worse than 4x a single channel — and blurring something like a big ole 1024x1024 ARGB picture is just never reasonable.

I keep wondering if a good approach would be to somehow handle the 4-byte ARGB chunk at once, or how to treat the ARGB version differently than the single channel. Been flipping through this to try to find some inspiration: Software optimization resources. C++ and assembly. Windows, Linux, BSD, Mac OS X

OK, let me know as soon as you have tried the new approach. I set up a test case (single channel, 512x512) and checked “ipp” vs “juceFloatVector” vs “gin” and I really did not see such big improvements over gin’s implementation as mentioned in your benchmarks. Given “ipp” is the fastest, “gin” is about 1.5x slower and the “juce” version is about 1.7x slower and produced buggy results.

1 Like

Interesting, was it similar for you across radii? Even though the FloatVector implementation probably won’t be in use long, I’d still be interested in how it broke (trying to make sure the tests catch all edge cases)

I tested for radius 5 and radius 25. The results were very similar.

I think the output is broken due to an edge case when the outermost pixels are set. This is what it looks like:

This is the code to create the image:

	m_image = juce::Image ( juce::Image::PixelFormat::SingleChannel, 512, 512, true );
	juce::Graphics g ( m_image );
	g.setColour ( juce::Colours::black );
	g.drawRect ( 0, 0, 512, 512 );
 	melatonin::blur::juceFloatVectorSingleChannel ( m_image, 5 );

2 Likes

Whew, back from ADC!

@reFXmkamradt Thanks for the details on what turned out to be the most literal of edge cases, it should now be fixed with tests here

Re: benchmarks, I opened an issue here if you’d be willing to give some detail on your CPU as well as as how you benchmarked (what framework or method of timing, if you included setting up a source image and juce::graphics context like in the benchmark execution, etc.)

Re: new algo, I’ve got a rough draft of the horizontal pass passing tests with gin-similar performance. I think I’m going to try out a variable divisor (as a replacement for padding the edges with the radius), but after that I think it would be best to address SIMD needs vs. optimize it more linearly. I’ll DM you next week, the algo is very straightforward (create 2 accumulating sums, do two adds and a subtraction, 1 divide (or mul/shift)).

1 Like

Spent another day or two and my conclusion is that the prefix sum algorithm’s main advantage is simplicity — it’s easier to SIMD-ify than traditional stack blur as there’s just less logic. It would be a better “fallback” for Melatonin Blur for that reason. It still can’t outperform vImage on larger images (not sure about IPP yet), I have no idea what magic they are up to there!

Initial tests on NEON/SSE2 show the prefix sum horizontal pass roughly ~2x faster than stack blur. The bottleneck (especially on larger images) becomes reading/writing to pixels across image columns for the vertical pass. I’ve started working on a strategy of rotating the image before the vertical pass (so it can be another horizontal pass reading/writing from contiguous memory).

But for now I will take a break and get back to my actual UI. @reFXmkamradt, I DM’d details, feel free to dig in if you still want to. I’m a SSE noob, and an AVX implementation would be great (SSE is doing 4 uint8_t prefix sums at once, it’s the main source of cpu time).

1 Like