Complex Text Layout on Linux


#1

Hi,

I've got an audio application that needs to display international text, especially several languages that require complext text layout (CTL), i.e. complex scripts like Arabic, Urdu, Farsi, etc.  

I see from a thread from a few years back (http://www.juce.com/forum/topic/updating-juce-text-system) that although the JUCE text system was reworked to support CTL on Mac and Windows, this functionality has not been extended to Linux.  I realize this is a lot of work and I'm not able to take it on myself at the moment.  However, I have an urgent need to display complex scripts.  

It would be great to continue using a juce::Label to do this, but it looks like that's not possible.  (It's not, right?)  I've just started trying to size up pango and GTK+ for accomplishing this, but it doesn't look like an easy fit so far.  Before I go too far down a rabbit hole, I thought I would put the question out to the JUCE forum:

Does anyone have any suggestions for a way to get a complex script displayed on Linux?   

Is there any code in the works to support CTL on Linux that could be shared?

Thanks!

Steve  

 

 


#2

Hi; I've been dealing with RTL issues myself.

The simplest thing to do is to override the LookAndFeel members which draw text, and use an "AttributedText" to do the text output.  This works ok for Labels and Buttons; unfortunately, the TextEditor is extremely complicated, and I've got someone trying to make it work for RTL, with not too much success.


#3

For editing arbitrary Unicode text, RTL is only one of your worries.

The AttributedString solves those fancy text layout issues by offloading all of that to the DirectWrite API. This does a lot of things, like:

  • Getting glyphs not present in the current font from another font. That's how an AttributedString can render characters from a lot more languages than a GlyphArrangement.
  • Combining marks. You can encode “é” either by inserting U+00E9 (e with acute accent), or by U+0065, U+0301 (e + combining acute accent). The renderer might use a precomposed glyph anyway in both cases. And when using the latter it may or may not desirable to erase both code points with one press of backspace. Note that you need to render the letter “i” with an accent without the dot, as in î.
  • For some complex scripts like Arabic you have character shaping, where the glyph you want for a given code point depends on the surrounding letters.
  • Sorting out bidirectional text. This gets really complicated when you have a mix of, say, Latin and Arabic scripts, and there are control characters to override the default direction of a script. A contiguous range of characters in your string might have a gap between them on the screen. And you may have to mirror [brackets] and “quotes” as well.
  • Ligatures. There are fonts, especially serif fonts used in print, which define ligatures. "fi" and "ffi" may be rendered with a single glyph, slightly different than the separate glyphs. The cursor might then need to be placed in the middle of that glyph. But Turkish differentiates between dotted and dotless i (i or ı), so it can't use those ligatures. Anyway, at least for on screen rendering you can get away with disabling the use of ligatures.

Any approach assuming there's a 1-to-1 mapping from a code point to a glyph, and those glyphs will be rendered one after each other, will never work.

I would say, if at all possible, rely on the API available from the operating system (no idea what would be the standard usually available API on Linux systems though). Those APIs allow you to treat the rendered text layout as a black box, and it allows you to ask at which offset to render the cursor for a given index in your text string (see for instance IDWriteTextLayout::HitTestTextPosition).

 

PS. Speaking of Turkish: Usually the upper case of "i" is "I". Unless you have Turkish text, then it is "İ". But that's a whole different can of worms.


#4

You need to have a look to Harfbuzz (it's what's being used in Firefox & Chrome). This component is doing the text rendering using the right unicode script application (that's the only code that works right now for all languages). The dependencies are not big (IIRC, if you need *everything*, you'll have to use ICU, but it's not required if you don't need everything from Unicode). 

You can make Harfbuzz generate raster lines (that is, the [ x_start   x_end ] for each horizontal line of the rendered text. This could be mapped without too many efforts to the EdgeTable used in Juce). The basic idea being you'll render your text with harfbuzz at 10x the size (or less depending on how much anti aliasing you need), then get the scan-lines from it (by copy and pasting the examples code from Harfbuzz site, it's 60 lines), then fill them in Juce (still to be done, but not necessarly hard).

This allows to display text in whatever langage correctly.

This however does not solve the text "input" issue, where you need to map a position to a glyph and then one or more character in your string (think of when the user press "left" key to go on the previous *displayed* char, or click on some text in a text editor to position the cursor).

If you have ligature or a script underneath, there the currently displayed glyph might correspond to multiple characters. Harfbuzz gives a bounding box for each glyph displayed (in terms of chars used for generating the glyph), so you can get the unicode's chars involved for this glyph.

This unfortunately is never going to work in Juce without a huge effort, because the current interfaces expect 1 glyph = 1 unicode char (which is wrong). I don't think there are any solution to this, expect paying Roli to hire some guy fixing the interfaces everywhere in the source code.

Cheers.