Unicode and font rendering in JUCE


#1

So, I went to the Lorem Ipsum generator and got a bit of text in a few different scripts. I also added a line of emoji (those are not images, but characters from the 'Emoticons' Unicode block).

Lorem ipsum dolor sit amet
את מדויקים מיוחדים אקטואליה לוח
أي غير موالية بتطويق.
謺貙蹖 郺鋋錋 蒠蓔蜳 餤駰 銈,
<emoji go here, but the forum breaks if I try to include them>

Depending on the font, a Label will render most rows with just boxes, and it will also render the Hebrew and Arabic left-to-right.

(image appears to be gone)

An AlertWindow fares better:

Not sure how it works exactly in JUCE, but it appears to make a few assumptions:

(1) You can render a block of text with just one font

This is the most crippling one in the list. On Windows, JUCE by default usese Verdana and Tahoma, which have a very limited coverage of Unicode, so you'll often see boxes instead of letters whenever you encounter something else than Latin text.

When a browser displays the font in the quote above, it will use glyphs from a number of different fonts to display that text. Eg. on Windows 7 the first line may be "Arial", the last line (the emoji) is probably "Segoe UI Symbol". You don't have to change font settings for this, the browser will do this automatically.

(possible bug: In the AlertWindow the line of emoji is truncated. There could be a mix-up between Unicode code points, and the amount of 16-bit elements used to encode them as UTF-16.)

(2) there is a 1 to 1 mapping between code points and glyphs

This one breaks for a few scripts. If you select a font with Arabic glyphs in the demo, you'll see it still looks wrong. Arabic letters can have a few different forms, which are selected depending on the surrounding letters.

(3) text goes left-to-right

Hebrew and Arabic should be written right to left.

 

What's happening behind the scenes?

From what I understand, JUCE uses two methods to render text:

  • For a lot of widgets (including the common ones like buttons and labels) it goes via GlyphArrangement, which goes via WindowsDirectWriteTypeface::getGlyphPositions. The code in that function starts from the above assumptions and as a consequence can only handle a very limited subset of Unicode.
     
  • For dialog windows, it goes via DirectWriteTypeLayout::createLayout, which uses DirectWrite directly to generate the entire layout. And the above text will be rendered properly.
     

So what?

Usually (at least for us Westerners) we can get away with this. Sometimes you can't. Eg. you may encounter files with names in other languages. You may encounter text containing these emoticons. Or someone may try to translate his application to another language.

Are there plans to improve on this?

--
Roeland


#2

Thanks - yes, we know.

The main problem is optimisation - we could change things so that DirectWrite/CoreText is used for all text everywhere, but it's very slow, so if we did that it'd make performance cripplingly slow in some apps. Ideally, it'd be nice to find a good cross-platform way of doing it, but the open-source libraries that do this are huge and have horrible dependencies, so would be a nightmare for people to build with.


#3

I think I may have made a partial duplicate of this thread here Current state of Fonts/Emoji/Unicode symbols

Are there any improvements on the horizon?


Current state of Fonts/Emoji/Unicode symbols
#4

Haven’t got any imminent plans, I’m afraid