Multithreaded rendering


#1

I have an application where a rather “heavy” component (rendering wise) is offloaded for rendering in a thread job (with cached images rendered into a SW graphics context). When the job completes the new component is posted back and added as child component of the main gui. This way the other parts of the GUI remain responsive. However, I’ve encountered a problem. Sometimes during the off-GUI-thread rendering, I get a crash in LowLevelGraphicsSoftwareRenderer::drawFittedText, mainly because I think that since GlyphCache is a singleton, its a problem with it being accessed from 2 threads at the same time. Ideas on how to resolve this ?


#2

Yes this is a problem. You could try forking the GlyphCache and making it concurrency-friendly by protecting it with a ReadWriteLock. Jules would probably object to using the juce::ReadWriteLock in the main branch because it is fairly heavy in terms of resource usage: GlyphCache would have to take the read lock for each character lookup.

But my “mostly” lock-free and wait-free read/write mutex is ideally suited for this pattern of access:

http://rawmaterialsoftware.com/viewtopic.php?f=2&t=8298

For caching glyphs, the “fast path” is the case where the glyph has already been cached, and the code to acquire the read lock would be both lock-free and wait-free, which is almost zero overhead compared to current single-threaded implementation of GlyphCache.


#3

Although it’s good with lock-free, this is after all “just” the GUI, and I think a CriticalSection would suffice just nicely. I tried putting one in LowLevelGraphicsSoftwareRenderer::GlyphCache class, with it locking the drawGlyph method:

    void drawGlyph (SavedState& state, const Font& font, const int glyphNumber, float x, float y)
    {
        ScopedLock lock(cs);

        ...
    }

and since then I’ve had no crash. But I’m not 100% of the solution or the place of the lock. Jules ?


#4

Yes… That’d probably be all that’s needed. It’s a safe bet to assume that you won’t have multiple threads fighting over the initialisation of the singleton object, so that’s the only method that would need locking.


#5

Unfortunately, I just got a crash (access violation). The callstack from the 2 threads are:

Thread 1 (main GUI thread):
	TheApplication.exe!juce::WindowsTypeface::getOutlineForGlyph(int glyphNumber=49, juce::Path & glyphPath={...})  Line 226 + 0x3 bytes	C++
 	TheApplication.exe!juce::Typeface::getEdgeTableForGlyph(int glyphNumber=49, const juce::AffineTransform & transform={...})  Line 55 + 0x17 bytes	C++
 	TheApplication.exe!juce::LowLevelGraphicsSoftwareRenderer::CachedGlyph::generate(const juce::Font & newFont={...}, const int glyphNumber=49)  Line 2418 + 0x5a bytes	C++
 	TheApplication.exe!juce::LowLevelGraphicsSoftwareRenderer::GlyphCache::drawGlyph(juce::LowLevelGraphicsSoftwareRenderer::SavedState & state={...}, const juce::Font & font={...}, const int glyphNumber=49, float x=13.100442, float y=16.908598)  Line 2488	C++
 	TheApplication.exe!juce::LowLevelGraphicsSoftwareRenderer::drawGlyph(int glyphNumber=49, const juce::AffineTransform & transform={...})  Line 2529	C++
 	TheApplication.exe!juce::PositionedGlyph::draw(const juce::Graphics & g={...})  Line 70 + 0x4a bytes	C++
 	TheApplication.exe!juce::GlyphArrangement::draw(const juce::Graphics & g={...})  Line 768 + 0xc bytes	C++
 	TheApplication.exe!juce::Graphics::drawFittedText(const juce::String & text="<None selected>", const int x=3, const int y=2, const int width=375, const int height=20, const juce::Justification & justification={...}, const int maximumNumberOfLines=1, const float minimumHorizontalScale=0.69999999)  Line 315	C++

Thread 2 (thread job):
 	ntdll.dll!77c00f34() 	
 	[Frames below may be incorrect and/or missing, no symbols loaded for ntdll.dll]	
 	ntdll.dll!77c006a0() 	
 	ntdll.dll!77bdb18c() 	
 	kernel32.dll!77547a7e() 	
 	TheApplication.exe!_free_base(void * pBlock=0x062ae7bc)  Line 109 + 0x12 bytes	C
 	TheApplication.exe!juce::CriticalSection::enter()  Line 77 + 0xc bytes	C++
 	TheApplication.exe!juce::GenericScopedLock<juce::CriticalSection>::GenericScopedLock<juce::CriticalSection>(const juce::CriticalSection & lock={...})  Line 69 + 0x1e bytes	C++
 	TheApplication.exe!juce::LowLevelGraphicsSoftwareRenderer::GlyphCache::drawGlyph(juce::LowLevelGraphicsSoftwareRenderer::SavedState & state={...}, const juce::Font & font={...}, const int glyphNumber=20, float x=52.153477, float y=532.00000)  Line 2451 + 0xf bytes	C++
>	TheApplication.exe!juce::LowLevelGraphicsSoftwareRenderer::drawGlyph(int glyphNumber=20, const juce::AffineTransform & transform={...})  Line 2529	C++
 	TheApplication.exe!juce::PositionedGlyph::draw(const juce::Graphics & g={...})  Line 70 + 0x4a bytes	C++
 	TheApplication.exe!juce::GlyphArrangement::draw(const juce::Graphics & g={...})  Line 768 + 0xc bytes	C++

Maybe the lock should be in LowLevelGraphicsSoftwareRenderer::drawGlyph ? (i.e. do a scoped lock using the CS from the GlyphCache)


#6

GlyphCache::drawGlyph is the right place for the lock but if you use a CriticalSection then all text drawing will be serialized, thus defeating the original goal which is to have multi-threaded rendering!

You want the read lock during the for loop, an atomic int for accessCounter and hits, and then take the write lock if the for loop doesn’t generate a cache hit.


#7

Ah… I checked in a version with a critical section this morning, but just realised that it’s not correct. The cached glyphs could get deleted while in use if the cache overflows, it’ll need some sort of ref-counting to avoid that, I guess.


#8

At this moment, I don’t really care about performance. What I care about is not getting an access violation terminating my appl…

Sounds sensible. Hmm… I’ll try that too :slight_smile:


#9

Where are you seeing this? If you wrap the entire GlyphCache::drawGlyph () function with a mutex then there is no need for reference counting, since the glyph gets drawn within the lock. Ref-counting seems overkill, a ReadWrite lock is much more appropriate here. The GlyphCache access pattern is mostly reads (cache hits), with infrequent bursts of writes.

The first thing we need to do is get rid of that return statement from the middle of the function, it gets in the way. Then change the counters to atomic variables. Finally, put in the read/write lock. How about this:

void drawGlyph (RenderTargetType& target, const Font& font, const int glyphNumber, float x, float y)
{
    ++accessCounter;
    int oldestCounter = std::numeric_limits<int>::max();
    CachedGlyphType* oldest = nullptr;

    bool found = false;

    rwLock.enterRead ();

    for (int i = glyphs.size(); --i >= 0;)
    {
        CachedGlyphType* const glyph = glyphs.getUnchecked (i);

        if (glyph->glyph == glyphNumber && glyph->font == font)
        {
            ++hits;
            glyph->lastAccessCount = accessCounter.get ();
            glyph->draw (target, x, y);
            found = true;
            break;
        }

        if (glyph->lastAccessCount <= oldestCounter)
        {
            oldestCounter = glyph->lastAccessCount;
            oldest = glyph;
        }
    }

    if (!found)
    {
      rwLock.enterWrite ();

      ++misses;
      if (hits.get() + misses.get() > (glyphs.size() << 4))
      {
          if (misses.get() * 2 > hits.get ())
              addNewGlyphSlots (32);

          hits.set (0);
          misses.set (0);
          oldest = glyphs.getLast();
      }

      jassert (oldest != nullptr);
      oldest->lastAccessCount = accessCounter.get();
      oldest->generate (font, glyphNumber);
      oldest->draw (target, x, y);

      rwLock.exitWrite ();
    }

    rwLock.exitRead ();
}

private:
    friend class OwnedArray <CachedGlyphType>;
    ReadWriteLock rwLock;
    OwnedArray <CachedGlyphType> glyphs;
    Atomic <int> accessCounter, hits, misses;

Of course, now we’re calling rwLock.enterRead() every time a character is drawn, whether it is in the cache or not. The juce::ReadWriteLock implementation for enterRead() is VERY expensive, and furthermore the juce::ReadWriteLock offers many more features than what we actually need for GlyphCache. That is why the lock-free read/write lock I posted is superior.

Compare:

juce::ReadWriteLock

void ReadWriteLock::enterRead() const throw()
{
	const Thread::ThreadID threadId = Thread::getCurrentThreadId();
	const ScopedLock sl (accessLock);

	for (;;)
	{
		jassert (readerThreads.size() % 2 == 0);

		int i;
		for (i = 0; i < readerThreads.size(); i += 2)
			if (readerThreads.getUnchecked(i) == threadId)
				break;

		if (i < readerThreads.size()
			  || numWriters + numWaitingWriters == 0
			  || (threadId == writerThreadId && numWriters > 0))
		{
			if (i < readerThreads.size())
			{
				readerThreads.set (i + 1, (Thread::ThreadID) (1 + (pointer_sized_int) readerThreads.getUnchecked (i + 1)));
			}
			else
			{
				readerThreads.add (threadId);
				readerThreads.add ((Thread::ThreadID) 1);
			}

			return;
		}

		const ScopedUnlock ul (accessLock);
		waitEvent.wait (100);
	}
}

and my lock-free implementation

void ReadWriteMutex::enter_read () const
{
  for (;;)
  {
    // attempt the lock optimistically
    // THIS IS NOT CACHE-FRIENDLY!
    m_readers->addref ();

    // is there a writer?
    // THIS IS NOT CACHE-FRIENDLY!
    if (m_writes->is_signaled ())
    {
      // a writer exists, give up the read lock
      m_readers->release ();

      // block until the writer is done
      {
        Mutex::ScopedLockType lock (*m_mutex);
      }

      // now try the loop again
    }
    else
    {
      break;
    }
  }
}

In the fast path the lock-free version is merely an atomic increment…


#10

(sorry, ignore my last post, I was talking crap…)

Yes, a read/write lock is probably better, I’ll do something like that. There are ScopedReadLock and ScopedWriteLock classes, you know, so it can all be done nicely with RAII.


#11

It might be just a little bit more complicated than I made it out to sound, now that I look at my code, you can see that glyph->lastAccessCount gets modified inside the for loop, which is only holding the read lock, so that will have to be addressed.

On the other hand, its only an issue if two threads try to draw the same glyph at the same time, in which case only one of them will “win” when trying to write glyph->lastAccessCount, and since the goal is to mark the glyph as the “youngest” perhaps its ok. Some analysis will be needed.

There are also some other points:

  • rwLock.enterRead() is a very heavy function to be calling once per glyph.

  • My lock-free ReadWriteMutex is much better for this access pattern, but I don’t support upgrading the lock from read to write (you have to release the read lock first, then acquire the write lock). Upgrading the lock is mandatory, or else a thread could slip in and invalidate cache the glyph and screw up a different thread which thought there was a cache miss.

  • We might want to make a thread-safe GlyphCache an option, and controlled via either preprocessor directives or through a constructor parameter when instantiating the GraphicsContext.

  • One solution that I am 99% positive will work is to just wrap the whole GlyphCache::drawGlyph() function in a mutex. But I am 100% positive this will reduce performance if a bunch of text gets drawn. I had a multi-threaded rendering system in my app at one point which I have abandoned because I had to change too much of Juce to get it to work. The performance loss is relative to the number of separate CPU cores used to render. The more cores, the higher the performance loss. Of course, rendering will still be faster than using a single thread (as long as there are non-text areas) but when drawing text the whole thing is serialized.

To the OP:

In my tests with concurrent rendering, the greatest gains are obtained when cores are put to work on different parts of the same small data set. In general, it is better to render horizontal strips of pixels instead of different objects. Furthermore, it is better to rasterize one shape using multiple cores than it is to rasterize a different shape per core. Or, more generally, the highest performance increase is seen when the concurrent data set fits into the CPU cache, and minimizes the number of cache lines used.

Squeezing the greatest amount of concurrency from the system during rendering involves a lot of non-trivial changes to Juce and without Jules backing it up with updates to the source tree, is definitely a maintenance nightmare which is why I abandoned it. But let me also add, on a 6 core i7 you can see some staggering improvements.