Drawing speed problem (Timer or other bottleneck?)

I just have done a test with an Image that gets zoomed and rotated via a Timer that should repeat every 10ms.

I did this test first in my own framework (with AGG used for drawing stuff), then in JUCE. Although the drawing speed of JUCE is good for the transformation (as seen in the milliseconds display in the top left corner), it seems that there’s a problem somewhere else, because the JUCE version of the small demo app is significantly slower than the one using my own framework.

Files at: http://www.2shared.com/file/unKdA0xg/timerProblem.html

Please check out the files - both versions were compiled using VC2008 (Release). The rotation/zoom code for both is identical, so that cannot be the problem.
It’s obvious that the Timer is not firing at 10ms anymore when the CPU gets used a lot because of the painting (but it should still be firing much faster than what it is, considering the time to draw the image is about 20ms!).
Maybe the Timer gets slowed down to some additional big overhead in JUCE for preparing the Graphics stuff and finally blitting it back to the window surface?

In any case, it would be worth to know what’s going on!

Here’s the main code:

[code]class MainComponent:public Component, public Timer
{
public:
MainComponent()
{
setSize(100,100);
mouseX=mouseY=0;
setOpaque(true);
angle=0.0;
startTimer(10);

	PNGImageFormat png;
	image=png.loadFrom(File("image.png"));
	image2=png.loadFrom(File("image2.png"));
}

double angle;
void timerCallback()
{
	angle+=0.01;
	if (angle>(2.0*3.1415)) angle=0.0;
	repaint();
}

void paint(Graphics &g)
{
	g.fillAll(Colours::black);

	g.setImageResamplingQuality(Graphics::lowResamplingQuality);

double t1=Time::getMillisecondCounterHiRes();

AffineTransform t;

double zoom=1.0+sin(angle)*0.25;
double a=-3.1415*0.5 + sin(angle)*0.25;

t=t.rotated(a);
t=t.scaled(zoom, zoom);
t=t.translated((float)mouseX, (float)mouseY);


g.drawImageTransformed(image2, t);

double t2=Time::getMillisecondCounterHiRes();

	g.setColour(Colours::white);
	g.drawText(String(t2-t1), 10, 10, 100, 12, Justification::centredLeft, false);
}

void mouseMove(const MouseEvent &e)
{
	mouseX=e.x;
	mouseY=e.y;
	repaint();
}

int mouseX, mouseY;
Image image, image2;

};[/code]

Have you tried profiling it in VerySleepy or something?

The 2shared service doesn’t work for me. You know you can upload directly to this forum ?

I couldn’t upload it as its bigger than 0,5 MB.

I never managed to get any sensible information using VerySleepy, never understood to use it properly. But I can gladly send you my test project.

I found out some more:

JUCE takes 25ms for the entire stuff being done in WM_PAINT (drawing done: filling a black 1920x1080 screen & showing the milliseconds used).
My frameworks takes about 14ms.

Filling the screen (black) takes about 4ms.
The DrawDibDraw itself takes about 10ms.

It seems JUCE is somewhere losing the extra 10ms. I haven’t checked JUCE’s code in depth but I think I remember that some allocation of the drawing surface was done each time? In my framework I’m only allocating the drawing surface one time (until window gets resized).

As for the animation problem, it really seems to be that the Timers stop firing at the prescribed rate as soon as there’s too much work in the GUI thread. This is actually good sometimes, but other times, one still would like to keep the Timers going at the original rate, even if there’s heavy work done in the GUI thread.

Interesting… I can’t really think where it could be losing that 10ms… Perhaps a difference in the choice of function that we’re using for the blit? My code can use either StretchDIBits or DrawDibDraw, and years ago when I originally wrote it, DrawDibDraw was always faster, but maybe in newer versions of Windows it’s faster to use StretchDIBits…? (If you want to try this out, it’s very easy to hack the juce code to make it use StretchDIBits)

I have tried BitBlt & DrawDibDraw (you are using DrawDibDraw), and they both take exactly the same amount of time. The extra 10ms are lost somewhere else. Couldn’t it be the allocation of the drawing surface? In any case, 10 ms is a lot, considering the fillAll() for 1920x1080 takes 4ms!

Ok, I’m stuck then! I can’t see anything else that happens in the paint callback that could take any significant amount of time!

I’ll try to find out.

Ok, I found out what it is. In JUCE DrawDibDraw takes 2x more time. I assumed it always would take 10ms because in my code it was taking 10ms. But in JUCE it really is 20ms.

When I force the pixelStride of the WindowsBitmapImage to 4, no matter if it is RGB or ARGB, then DrawDibDraw suddenly only takes less than 10ms (5-10ms) in JUCE! So that’s it.
It’s probably because if the gfx card has colour depth 32, then DrawDibDraw needs to convert from 3 bytes per pixel to 4 bytes per pixel internally, and that takes time.
I guess most gfx cards use 32Bits today.
One could add some code that checks the bit depth of the gfx card and adapts pixel stride accordingly, even if the last of the 4 bytes is of no use?

Wow! Didn’t expect that, but I guess it kind of makes sense…

Maybe it should just always use an ARGB image then? That’d be a simple way to avoid the issue, and probably wouldn’t have any impact on the speed of anything else.

I guess that would work well if the graphics card colour resolution is 32Bit. But in case it would be 24 Bit, then it would be slower again (since then a transformation would again be required by DrawDibDraw to go from 32 to 24) ?

Maybe. Though I don’t think there’s any reliable way to find out what’s going on inside the graphics card with that kind of detail. Not sure what to suggest!

HDC dc = GetDC (0); int bitsPerPixel=GetDeviceCaps(dc, BITSPIXEL);

does this means it could speed up juce drawing on Windows by a factor of 2 ?
(my card reports 32bits as well)

Well not really, it’ll only speed up the time taken to get your finished drawing onto the screen, and unless you’re animating the entire screen you’re unlikely to see a measurable difference.

But I’ve checked in some code if you want to try it out…

Jules, this applies to texture uploads to OpenGL too. Some rules of thumb:

Any pixel format without alpha will trigger a time-consuming CPU swizzle before the actual upload. Depending on upload method, that can also cause a CPU <> GPU sync, so the extra time is doubly poorly spent.

An 8-bit unsigned integer RGBA format will transfer quicker in a BGRA order on Macs and on NVidia cards on other platforms (not sure about AMD, Intel, VIA etc.) Mac has some special enums for this.

Most other pixel formats (16-bit float, 32-bit int etc.) are best transferred as RGBA.

It’s to do with the internal texture storage on the card.

There’s tools to test it - transferBench is one. EDIT - you can also see it in the JuceDemo, when enabling the OpenGLRenderer, the RGB tiled image is notably slower than the ARGB.

FYI - best in the BGRA case is to ‘tell’ OpenGL that the pixel format is correct, then use a pixel shader to swizzle it (for free essentially). I also use this approach for a range of other pixel formats.

Bruce

PS - Jules, your Git comments talk about ‘on 32-bit graphics cards’. I think that’s all there is these days.

Wow! Blitting time went down from 20ms to about 5ms here! Amazing! :slight_smile: