Unicode Rendering Bug (Mac & Windows)

Hi all,

I’m using juce::AttributedString to render Unicode text (UTF-8) containing fancy characters such as emojis.

It works, but it stops short of rendering the full string. Here’s some test code taken from a component paint() method:

        juce::AttributedString attrString;
        attrString.setText( L"Abc🍕😎🏈Def" ); // I know - don't embed actual Unicode chars in source code*.
        attrString.setFont( juce::Font( 20.f ) );
        attrString.setJustification( juce::Justification::centred );
        attrString.setWordWrap( juce::AttributedString::WordWrap::none );
        attrString.setColour( juce::Colours::white );
        attrString.draw( g, getLocalBounds().toFloat() );

This is the result on Windows:

image

And Mac:

image

Note that it chopped the last 3 characters off on Windows, and the last 3 characters are the wrong font/color on Mac.

The first bug is in juce::AttributedString::setText (const& String newText):

auto newLength = newText.length();

In this example, newLength is set to 9 which seems correct, right? There are 9 characters in total.

Not quite. It should actually be 18! Why? Because it needs the size in bytes not characters. And since each emoji is represented by 4 bytes in UTF-8, the total for 6 normal characters (6 bytes) plus 3 emojis (3 * 4 = 12 bytes) is 18.

Strictly speaking, I’m not 100% sure if it’s bytes or something more ‘Unicodey’ like ‘code points’ or whatever, but they might be equivalent with UTF-8 anyway.

Regardless, changing it to this gets it working on Mac only:

const auto newLength = (int) newText.getNumBytesAsUTF8(); // Returns 18 in this example.

image

Unfortunately, an additional fix is required on Windows:

juce::DirectWriteLayout::setupLayout() passes the wrong string length to IDWriteFactory::CreateTextLayout() (see juce_DirectWriteLayout.cpp line 370 in JUCE 6.1.5). Instead of using String::length() it should pass the length in UTF-16 characters (Unicode code points?), which is 12 in this example (6 normal characters plus 3 x 2 bytes for the emojis). For example:

    const std::wstring wstr (text.getText().toWideCharPointer());
    const auto textLen = (UINT32)wstr.length();

    hr = directWriteFactory.CreateTextLayout (wstr.c_str(), textLen, dwTextFormat,
                                              maxWidth, maxHeight, textLayout.resetAndGetPointerAddress());

And voila!

image

There may be more to it than this - I’m not a Unicode expert, nor am I intimately familiar with the JUCE text rendering code - but it’s working well here.

Would be nice to see it properly fixed in JUCE, followed by an update for juce::TextEditor so we can let users actually edit Unicode text in 2022!

Many thanks,
Ben

7 Likes

It occurred to me I’ve only been testing single line strings. There will probably be complications with multi-line strings.

Edit: Confirmed. I forgot to mention I also changed an assert at the top of AttributedString::draw() from this…

jassert (text.length() == getLength (attributes));

…to this…

jassert ((int) text.getNumBytesAsUTF8() == getLength (attributes));

Which worked nicely until I tried it with a 2 line string. Looks like getNumBytesAsUTF8() includes “\n” (as you would expect), but getLength (attributes) doesn’t. Obviously, it’s just an assert and it still renders fine. I’m tempted to disable the assert for now.

Thanks for reporting this issue.

The root of the problem was that the macOS and Windows built-in formatted string facilities both use UTF-16 strings internally. Attribute positions are specified as offsets into an array of 16-bit words. Because UTF-16 is a variable-width encoding, codepoint N inside the array doesn’t necessarily exist at index N. An additional conversion is required to translate the character index to an offset in the 16-bit buffer.

I imagine the problem has been overlooked for a long time because many commonly-used glyphs can be represented using a single UTF-16 word, so the indices happened to line up.

I’ve implemented a fix for the problem here:


I know that you know not to do this, but it’s really not ever a good idea, not even in testing code! I’d recommend using the UTF-8 String-Literal helper tool in the Projucer (in the Tools menu) to convert unicode strings into a format suitable for inclusion in a C++ source file.

3 Likes

Thanks for looking into it @reuk ! Very much appreciated.

Glad you were able to home in on a more thorough fix (albeit with some internal issues still up for discussion).

And yes, the embedded emojis were just a super quick convenience, since I could force both VS and Xcode to save/load the .cpp file with appropriate encoding. But you’re absolutely right, of course, that the Projucer helper is the way to go under normal circumstances.

Thanks again,
Ben

Could you elaborate on this please? Are you seeing problems after updating to the newest develop?

Oh, my bad!!! I just misread what you said about the additional conversion that translates the character index to an offset in the 16-bit buffer. I read it as “an additional conversation is required…”, as in the fix isn’t quite 100% finalized yet!

But, of course, it is! Thanks again :slight_smile:

1 Like

Cool, thanks for clarifying!