Unicode and font rendering in JUCE


#1

So, I went to the Lorem Ipsum generator and got a bit of text in a few different scripts. I also added a line of emoji (those are not images, but characters from the 'Emoticons' Unicode block).

Lorem ipsum dolor sit amet
את מדויקים מיוחדים אקטואליה לוח
أي غير موالية بتطويق.
謺貙蹖 郺鋋錋 蒠蓔蜳 餤駰 銈,
<emoji go here, but the forum breaks if I try to include them>

Depending on the font, a Label will render most rows with just boxes, and it will also render the Hebrew and Arabic left-to-right.

(image appears to be gone)

An AlertWindow fares better:

Not sure how it works exactly in JUCE, but it appears to make a few assumptions:

(1) You can render a block of text with just one font

This is the most crippling one in the list. On Windows, JUCE by default usese Verdana and Tahoma, which have a very limited coverage of Unicode, so you'll often see boxes instead of letters whenever you encounter something else than Latin text.

When a browser displays the font in the quote above, it will use glyphs from a number of different fonts to display that text. Eg. on Windows 7 the first line may be "Arial", the last line (the emoji) is probably "Segoe UI Symbol". You don't have to change font settings for this, the browser will do this automatically.

(possible bug: In the AlertWindow the line of emoji is truncated. There could be a mix-up between Unicode code points, and the amount of 16-bit elements used to encode them as UTF-16.)

(2) there is a 1 to 1 mapping between code points and glyphs

This one breaks for a few scripts. If you select a font with Arabic glyphs in the demo, you'll see it still looks wrong. Arabic letters can have a few different forms, which are selected depending on the surrounding letters.

(3) text goes left-to-right

Hebrew and Arabic should be written right to left.

 

What's happening behind the scenes?

From what I understand, JUCE uses two methods to render text:

  • For a lot of widgets (including the common ones like buttons and labels) it goes via GlyphArrangement, which goes via WindowsDirectWriteTypeface::getGlyphPositions. The code in that function starts from the above assumptions and as a consequence can only handle a very limited subset of Unicode.
     
  • For dialog windows, it goes via DirectWriteTypeLayout::createLayout, which uses DirectWrite directly to generate the entire layout. And the above text will be rendered properly.
     

So what?

Usually (at least for us Westerners) we can get away with this. Sometimes you can't. Eg. you may encounter files with names in other languages. You may encounter text containing these emoticons. Or someone may try to translate his application to another language.

Are there plans to improve on this?

--
Roeland


#2

Thanks - yes, we know.

The main problem is optimisation - we could change things so that DirectWrite/CoreText is used for all text everywhere, but it's very slow, so if we did that it'd make performance cripplingly slow in some apps. Ideally, it'd be nice to find a good cross-platform way of doing it, but the open-source libraries that do this are huge and have horrible dependencies, so would be a nightmare for people to build with.


#3

I think I may have made a partial duplicate of this thread here Current state of Fonts/Emoji/Unicode symbols

Are there any improvements on the horizon?


Current state of Fonts/Emoji/Unicode symbols
#4

Haven’t got any imminent plans, I’m afraid


#5

We’re getting complaints from users loading files with names containing international characters that aren’t displayed correctly. Wouldn’t it be possible to check the string itself for non-latin characters and use the optimized text rendering code when possible and fall back to DirectWrite / CoreText when required (i.e. the string contains non-latin characters)?


#6

Even with directWrite, you will have issue as the fallback font do not work if you use a custom juce font.

In my code I sometimes do trick like that to avoid using my custom font if I detect non latin characters.

static bool canBeRepresented(const juce::String &text)
    {
      bool res = true;
      int i = text.length();
      while (res && i >= 0)
      {
        int32_t c(text[i]);
        if (!(
              (c < 0x024F) || // latin char
              ((c >= 0x2000) && (c < 0x20D0)) // punctuation
              ))
        {
          res = false;
        }
        i--;
      }
      return res;
    }

#7

why not just return false; as soon as you detect it?


#8

Might save a few instructions, but the while loop will exit as soon as res == false
Would make sense though to dump the res variable entirely and just return immediately as you suggest :

static bool canBeRepresented(const juce::String &text)
    {
      int i = text.length();
      while (i >= 0)
      {
        int32_t c(text[i]);
        if (!(
              (c < 0x024F) || // latin char
              ((c >= 0x2000) && (c < 0x20D0)) // punctuation
              ))
        {
          return false;
        }
        --i;
      }
      return true;
    }

#9

Actually, that’s irrelevant as far as performance goes… however the String::operator[] method always has to scan the entire string to find a character, so a function written like that will get exponentially slower as the length increases - try this with a big enough string and it’ll lock up for seconds or even minutes!

Here’s how it should be written: :slight_smile:

static bool canBeRepresented (juce::StringRef text)
{
    for (auto t = text.text; ! t.isEmpty();)
    {
        auto c = t.getAndAdvance();

        if (! (c < 0x024f || (c >= 0x2000 && c < 0x20d0)))
            return false;
    }

    return true;
}

#10

Thanks, Oliver! I do use a custom font, so I’ll need to do something similar. These file names appear inside all kinds of JUCE components, so it’ll be a rather time consuming issue to solve… I hope for a more general solution in JUCE in the future.


#11

Thanks for the tip, Jules. I’ve actually encountered this problem before as I assumed that the [] operator would be O(1). I wrote a utility to load a JUCE project and create dictionaries (and lists of missing strings) for translations based on the source files and a translation data base. Needless to say, It turned out extremely slow… Your solution is much nicer than the work-around I used… :slight_smile:


#12

Thanks Jules.
Still I wouldn’t mind not having to do this at all :slight_smile:


#13

Yep, understood!


#14

I’m still not sure why the JUCE drawing methods cannot just look for international characters and split the text up in chunks. The parts that can be displayed with the optimized code could the be drawn with the respective methods and the other text chunks could use the DirectWrite glyph run stuff and the equivalent for CoreText. Am I missing something?


#15

Well a) there’s no such thing as “international” characters… it’s all unicode, and all that matters is whether a particular glyph is in a particular font or not. You could have a font that is missing some basic ASCII characters too.

And b) the way the native layout functions tend to work is that they take a chunk of text and lay it out as a whole, so you can’t really split it up and then somehow join together separate layouts for different bits


#16

And it would look different


#17

Also, you then have to handle other hairy layout issues like how to reflow all those bits over multiple lines, how to handle right to left text, etc.

There’s also cases where the current system breaks for Latin text. Try what happens if you past â ê î in a Juce text editor.


#18

Thanks – I realize that this is complicated stuff, but simply showing nothing seems like the worst of all alternatives. Text rendering doesn’t seem to work with a font like Google Noto either, so it’s not only a matter of missing characters.


#19

That Chinese, Arabic and Latin texts look different would be the least of my concerns… :wink:


#20

If you use the new juce::Typeface::createSystemTypefaceFor for your custom font it should help.

In my code I use a juce::CustomTypeface because this is the only thing that was available at the time and if I switch to the new createSystemTypefaceFor, it looks different, so…