Faster Blur? || Glassmorphism UI

Thanks for this! This should be in JUCE. I’ve alone wasted enough hours trying to make the JUCE drop shadow match the design. Took me far less with this, counting the time to fix the translation. There are many solutions in this thread, but this was the simplest. At least for me.

Any idea why JUCE isn’t improving the drop shadows, even though all the work has already been done by the good devs on this thread?

I opened a Feature Request asking the JUCE team to prioritize vector UI tooling and improvements. There was a recent fix for a bug with the shadower but no communication from the team so far on planned improvements.

I wonder if anyone on the current JUCE team is building UI/products with JUCE (as they were in the past). If not, they probably aren’t experiencing the same day to day friction that we do. IMO the lion’s share of work building anything real is the UI implemention. I’d love to see the JUCE team hire a UX/UI/design team member and dogfood some first class UI implementations so these things can receive the attention and priority they deserve.


This is very fast indeed. A quick profile showed it to be almost 4x faster than Gin Stack Blur.

1 Like

This one cheats. It only does a single channel, so it works fine for simply shadows, but if you want to blur an ARGB image, it will not be able to do it. Ideally, it should detect the image type and then use the single channel of four channel algorithm.

It definitely works on ARBG images but I haven’t done any tests with images that actually make use of an alpha channel

cc:@gyohng1 @reFX

I can confirm that this works for ARBG images and I am using alpha in those images. If you need the blur to spread even more, you can resample the image down, blur, then resample up again at high quality. You can balance the look and quality of the down/upsample with the performance gained by blurring fewer pixels, so combining these techniques has been really interesting for CPU-based blur and saved me from the complexity of implementing OpenGL (for now).

Yeah, sorry was looking at the wrong project. This one is worse than cheating. You would need to compile one version of each algorithm depending on the blur-size, and then select the right one at runtime. But whatever works for you.

the Battle for the Fastest Blur continues…

Don’t the templated parameters mean that any implementation you use (blur size, contrast) will be baked-in at compile time?

Depends on your use case! I don’t know the structure @reFX works within, but if portability and a sane maintainable api to work in a bunch of developers is your game, I have no argument with his criticism.

It’s inappropriate for an API, but in my case where I have strict control and nobody to answer to (except my future self, who, when this invariably no longer works with whatever pixel format I naively throw at it in 2024, will revisit this comment in shame), it works for now.

As for templating pixel type, you’d have to template the color channel count though and re-write some of the code to use it. It assumes RGBA in the specific layout that work with the pointer math in the method.

It would mean that

blurImage<10> (image);
blurImage<15> (image):
blurImage<20> (image);

would all get stampted out as separate functions. Not the worst thing in the world, but I doubt it’d be a significant performance gain - I’d be interested in seeing it run in a profiler against other approaches.

I looked at templating my approach, since dynamically allocating different amounts of memory based on the blur size was very inefficient.

In the end I settled on capping the blur size to 255 and just allocating the max memory needed. So it’s less efficient on memory usage, but much better for CPU which is the thing that needs addressing.

Doesn’t look like any of the other approaches make use of juce::ThreadPool which I found to have a huge impact, so I’d be interested in seeing if the other approaches have similar gains.

@reFX the algorithm is run sequentially over all 4 channels. It works well with the default JUCE premultiplied ARGB image format.

Only values 1 and 2 make sense - anything above that would typically blur too much. This value is passed directly to the bit shift operation. The approach in the above algorithm is not based on a box blur but is essentially a 1st order IIR run forward and backwards. It also does not need a temporary buffer.

I never needed fine control over the blur amount, and the question always is, how many different blurs do you need in your code? But it might still be possible to replace the shift constant with a fixed point multiplication, which may not be bad for performance as long as the multiplication value is cached in a register or used as a constant, for example, by replacing the following code:

int px = int(p[0]) << 16;
s += (px - s) >> blurShift;
p[0] = s >> 16;


int px = int(p[0]);
s = px*blurConstant + (s * otherBlurConstant >> 12);
p[0] = s >> 12;

blurConstant is 0 to 4095
otherBlurConstant = 4096 - blurConstant

(not verified, but should be similar to that)

One way how I can think this code can be improved is by dropping fixed point arithmetic and processing two channels at once with bit masking. Some vector rasteriser libraries do it this way. I.e., working with BBGGRRAA as BB00RR00 and 00GG00AA, masked via 0xFF00FF00 etc - these can handle additions and shifts with some restrictions and post-masking, which might be sufficient for a simple blur, but the blurImage(img) code worked well for me, so I didn’t bother with mental acrobatics to implement the latter approach.


1 Like

This is fascinating George, thanks for sharing your knowledge!

Ahh I see, interesting! I hadn’t looked into how that worked in detail.

I’m inclined to agree - but our designers would not. We currently have this on our backlog because of how JUCE’s current shadow API is very slightly different to that in Figma:

So we’d need that super-fine control to get the shadows to look as close to tools like Figma, Illustrator, etc. as possible.

Shadows - maybe it’s best not to use any blur for it if possible, but draw them using linear gradients (and pixel-aligned radial gradients for the corners) or pre-rendered bitmaps (also based on gradients). What I needed the blur for is to get this effect:

JUCE has a convenient screenshot function for components, which can then be processed by blur and used as a background for the overlay widget. Obviously, this approach doesn’t support dynamic updates of the underlying widget, but in most cases, we could live with it.


1 Like

Two micro changes. Some compilers will complain about the if enhanceContrast statement or not eliminate the extra branch unless the code is more explicit:

     if constexpr (enhanceContrast)

Second, clamps/jlimit are pessimistic, so the following can also help instead of jlimit in cases where the value will be in range the vast majority of the time such as mine. You have to profile to know with your compiler and options, but instead of jlimit or std::clamp, you can try:

inline T limitOptimistic(T val, T min, T max){
    if (__builtin_expect(val < min, 0))
        return min;
    if (__builtin_expect(val > max, 0))
        return max;
    return val;

I’ve been working with vdsp a bit and thought I may as well see what their Tent Blur looks like.

It looks very similar to stack (mainly looking at drop shadows in the moment)

200x200px single channel blur on macOS M1 Release, vdsp’s Tent Blur is a bit faster than (FigBug)'s stack blur.

500x500px single channel blur on macOS M1 Release, vdsp seems to be > 2.0x as fast as Stack:

Edit: And seeing similar results on a 2015 Intel MBP

One nice thing about the vdsp route is it’s only a few lines of code:

juce::Image::BitmapData data(img, juce::Image::BitmapData::readWrite);
juce::Image::BitmapData blurData(blur, juce::Image::BitmapData::readWrite);
vImage_Buffer src = { data.getLinePointer(0), height, width, (size_t) data.lineStride};
vImage_Buffer dst = { blurData.getLinePointer(0), height, width, (size_t) blurData.lineStride};
        nullptr, 0, 0,
        radius * 2 + 1, radius * 2 + 1,
        0, kvImageEdgeExtend);

I’ll check out the similar functions for IPP on Windows… It makes sense to me to move all vector/matrix stuff where possible (both image/dsp) to these highly optimized libraries.


@sudara The ‘Tent’ blur sounds very promising, we would call that ‘low hanging fruit’ for the JUCE team!

Ok, I tried out Intel’s FilterGaussianBorder on 500x500px - they don’t seem to have a more efficient blur option and I was too lazy to write a custom kernel for tent convolution.

Overall, it’s slightly slower than Stack on my current machine (AMD Ryzen 9 5900HX), which is maybe to be expected since Guassian does a lot of work…

One note: drawImageAt seemed strangely expensive on Windows (I showed a milder version here), regularly taking up to 5-10ms to draw the 500x500px image, which seemed a bit suspicious? I saw this on the mac machines too, but it seemed to only do that on first paint or two, so I assumed some allocation or memory cache effect was happening. Notable, because in those cases the drawing of the image is 5-10x more expensive than the creation of it, taking the whole operation it out of the “safely animatable” range of timings (I consider this to be <5-10ms on nicer machines).

The ippi API is pretty gross, requiring in/out vars, many calls to prep things, manual custom allocation/freeing. Had no idea what to choose for the Gaussian sigma (or what’s normal), so I tuned it by eye:

// intel calls the area being operated on roi (region of interest)
IppiSize roiSize = {(int) width, (int) height};
int specSize = 0;
int tempBufferSize = 0;
Ipp8u borderValue = 0;
ippiFilterGaussianGetBufferSize(roiSize, radius * 2 + 1, ipp8u, 1, &specSize, &tempBufferSize);
auto pSpec = (IppFilterGaussianSpec *) ippsMalloc_8u(specSize);
auto pBuffer = ippsMalloc_8u(tempBufferSize);
ippiFilterGaussianInit(roiSize, (radius * 2 + 1), 10, ippBorderRepl, ipp8u, 1, pSpec, pBuffer);
auto status = ippGetStatusString(ippiFilterGaussianBorder_8u_C1R(
        (Ipp8u *) data.getLinePointer(0), data.lineStride,
        (Ipp8u *) blurData.getLinePointer(0), blurData.lineStride,
         roiSize, borderValue, pSpec, pBuffer));

Anyway, I went down this path originally because I thought it would be interesting to try a IPP/vdsp optimized version of stack blur, but got distracted by these built in functions. Pretty neat those are included. I might eventually try writing a stack blur algo, but it’s back to work for now!

1 Like