Accented characters display problem


#1

Hi Jules,

 

I think I've come across a problem in Juce with accented characters. I'm using the tip and I have reproduced it in the Juce Demo project.

If you have in your system a file or a directory with a filename that contains accented characters separated by spaces, e.g. "é é é.wav", and you look for this file in the file tree of the Audio->'File Playback' tab of the Juce Demo project, some characters in the filename will appear as blank, e.g. "é é wav". I have only been able to reproduce this problem when the accented characters are separated by blank spaces, e.g. "ééé.wav" filename is displayed properly. I can only reproduce this in Mac, not in Windows.

 

I have found out that accented characters can be represented in two different formats in Unicode :

- plain character code + accent code, e.g. in UTF8: é = "\x65\xcc\x81"

- accented character code, e.g. in UTF8 é = "\xc3\xa9"

When the characters in the string are written in the second format they seem to be displayed correctly, it's only the first format that generates this problem. The problem in the FileTreeComponent appears because the file name obtained by the DirectoryContentsList component is written in the first format.

 

Another thing that I have noticed about this issue is that it only happens with Label components, not with TextEditor components. If I set a Label containing this type of text as editable, the moment I click on it and the TextEditor is opened, the characters are correctly displayed. And as soon as I close the editor, the Label component displays it wrongly again.

 

Thanks,

Elvira


#2

This is because there are 2 forms of UTF8, NFC and NFD. MacOSX is using the NFD format (the one with 3 bytes), while everyone else on earth is using the former (with 2 bytes).

Typically, if you need to handle the unicode transformation processing, then you are out of luck with native Juce's text code, as it does not apply any unicode script to the text before rendering.

You can use native text rendering code, like DirectWrite on Windows, or "probably" CoreText on MacOSX (not sure about this one), so the OS does apply the unicode script for glyph transformation/substitution, but this means non-portable code.

On Linux, you have to use Harfbuzz-ng, and/or ICU yourself, it's not integrated in Juce.

 

What you've seen here for French is even worse in Arabic or more complex languages like Indic.

 

 


#3

There's actually a method String::convertToPrecomposedUnicode() which might do the trick for you.

IIRC the OSX filesystem doesn't automatically precompose the names, where other OSes do.


#4

String::convertToPrecomposedUnicode() solved my problem, thanks!