Embedding unicode string literals in your cpp files


#1

This topic seems to come up again and again, so I thought I'd make a sticky post here to avoid having to repeatedly explain it...

We regularly get people saying "The fonts are broken, my Chinese/Japanese/etc text won't display correctly!"

And the most common reason for this has nothing to do with fonts or graphics, it's because people have written code like this:

String textToDisplay = "一些文字";

The code above is going to screw up the encoding in at least some situations, depending on the compiler and editor that are involved. And no, it can't be magically fixed by sticking an L in front of the literal.

The String class is expecting UTF-8 characters, but compilers have no idea what type of encoding your text editor was using when you saved the source-file, and they'll make an assumption which is generally going to be wrong. So most likely, the encoding is going to get garbled somewhere between your editor, the compiler, and the library classes. The ONLY cross-platform way to embed a unicode string into C++ source code is by dumbing it down to ASCII + escape characters. That's a pain to write by hand, but luckily if you fire up the Introjucer and use its "UTF-8 String Literal Helper" tool, it'll do all the messy stuff for you, and convert any unicode string into safe C++ expression that you can paste into your code, e.g.

String textToDisplay = CharPointer_UTF8 ("\xe4\xb8\x80\xe4\xba\x9b\xe6\x96\x87\xe5\xad\x97");

 


Different behaviour of TextEditor on macOS and Windows
#2

Chinese users had problems displaying their devices in a AudioDeviceSelectorComponent object (squares instead of plain characters).

I fixed the problem by adding the following lines in the MainContentComponent constructor:

#if JUCE_MAC || JUCE_WINDOWS

    getLookAndFeel().setDefaultSansSerifTypefaceName("Arial Unicode MS");

#endif

Although this is a specific implementation and won't be sufficient in all cases (Jules you might want to fix this in the modules regarding the supplied AudioDeviceSelectorComponent class), the interesting point is that the "Arial Unicode MS" font seems to be compatible with both Latin and Chinese characters at once, both on Win and Mac. I thought I would share this and hope it can help some of you..


#3

A better fix is to use TextLayout in the LnF of Combo and PopupMenu which will find fallback fonts when glyphes are not available in the current one.


#4

Thank you for this. This might be the way to go for the AudioDeviceSelectorComponent problem (Jules to decide).

As I said my main point was that the Arial Unicode MS seems to work and is a simple way for cases where there is no code to find default fonts that would work.


#5

Shoot me down if I'm wrong - but I believe this does the right thing in C++11 and is a little less awkward:

const String fontAwesomeFolder = String::fromUTF8(u8"\uf114");

The bible says: "A string literal that begins with u8, such as u8"asdf", is a UTF-8 string literal and is initialized with the given characters as encoded in UTF-8."

I think this means that the source will (obviously) be in the basic character set, and the string will only ever be a valid UTF8 encoding, which I think makes it the same as the example using CharPointer_UT8.  Presumably the alternative below does the same thing but I've not tested it:

const String fontAwesomeFolder = CharPointer_UTF8(u8"\uf114");

(Update: not supported in Windows with CTP Nov 2013 compiler - looks like it's in VS 2015 preview though).


#6

Hi,

is this working on unicode geometric shapes too ? because I tried and the result was only a square and not the geometric shape i wanted...


#7

Does your font definitely have the geometric glyph you want?  I found on Windows that the system font was missing a lot of what I thought was obvious stuff... 


#8

Actually I'm adding these lines 
 


#if JUCE_MAC || JUCE_WINDOWS getLookAndFeel().setDefaultSansSerifTypefaceName("Arial Unicode MS"); #endif

and the geometric shape appears...but I don't know how to change the font size of a textbutton text, because I am using this geometric shape in a button and don't know how to make it look bigger.Maybe it can be done using LookAndFeel, but I am not very familiar with it as I am new in juce.


#9

Hi there.

Try the following.

Create a custom LookAndFeel class for your button. You can do this by deriving from one of the default LookAndFeels and overriding the function that defines the font for the button text:

class MyLookAndFeel : public LookAndFeel_V3
{
    Font getTextButtonFont (TextButton&, int buttonHeight) override
    {
        return Font ("Arial Unicode MS", 20.0f, Font::plain);
    }
};

Then, create an instance of this class and pass it to the button by calling the buttons's setLookAndFeel method. That should do the trick.


#10

Thanks…it does the trick. :wink:


#11

Visual Studio 2015 RC has support for u8"blabla" literals, see https://msdn.microsoft.com/en-us/library/69ze775t%28v=vs.140%29.aspx , but that solves only half the problem.

The compiler also has to know the encoding of the source file. As far as I know there's no compiler option to specify the encoding of the source file. MSVC will assume it is the current code page of the system, based on your system-wide language settings. In other words, it depends on what computer you compile the source file on.

Unless you save the source files as either UTF-16, or as “UTF-8 with byte order mark(†)”. In those 2 cases the encoding is detected correctly.

So if you're able to consistently make sure files are saved in that encoding and all the other compilers you use support that, then maybe you can write u8"☺☺☺". Otherwise it's still at least u8"\u2639\u2639\u2639"

 

(†) The bytes [ef bb bf], the UTF-8 encoding of U+FEFF, are often prepended to an UTF-8 text file as a magic number to tell applications the file is encoded in UTF-8 and not in whatever code page your system is using. However some programs (eg. PHP) will misintrepret that as the file starting with U+FEFF or  or whatever.


#12

Hello,

On windows 7 the code:

String degreesymbol = String::fromUTF8("\u00B0");

Font font("Arial Unicode MS", 24, Font::bold);

g.DrawTest(degreesymbol , 4, 0, width - 4, height, Justification::centredLeft, true);

Does not show the degree symbol, only '0' - this is similar for other characters. Arial Unicode MS is definitely installed and has the characters in it.

In the Juce demo you can paste the degree symbol into the Font demo and it displays correctly under this font. Any ideas why this isn't working?

Thanks, Ivan


#13

You're asking it to parse some UTF8, and then giving it a literal that isn't UTF8!

If you read the original post in this thread, that's exactly the mistake I was talking about!


#14

I don't understand I'm sorry. I though that by writing

String degreesymbol = String::fromUTF8("\u00B0");

I was creating a String that could then be drawn using drawText to show the appropriate character.

Is that not the case?

If this is not correct, can you tell me what I can do to achieve this?

Thanks for your assistance.

Added a few minutes later...

Actually I now can do it using

String degreeSymbol = CharPointer_UTF8("\xc2\xb0");

So I wasn't using the correct utf8 code? Apologies I am learning this from zero knowledge - I will research some more and try to get a better understanding.

It works anyway which is the important thing for the moment! :-)

Thank you,

Ivan


#15

you want to have a look at the introjucer-> menu 'Tools' -> 'UTF 8 String literal helper' 


#16

Thanks you can see from my edited reply that's what I just did - I think I understand now just about but I will read some more!


#17

I'm trying to display musical symbols using Graphics::drawText(). I have the chart in http://www.unicode.org/charts/PDF/U1D100.pdf

The symbols start at 1D100, so obviously not in the 16 bit range. How can I specify these in the code? The tool is unfortunately no help, because it takes the actual symbol but not a hex code. Also the tool creates 16 bit codes...

I'm lost...


#18
const juce_wchar myUnicodeString[] = { 0x123456, 0x345678, 0 };

String s (myUnicodeString);

#19

Thanks for the syntax. But I fail with the semantics. Can you please give me one example for 0x123456 and 0x345678?

e.g. a soprano clef: 1D11E ?

I tried various combinations and converting a 4-byte word into two 3-byte words and a 0? The search engines are spammed with misunderstandings of types and unicodes, so I had no luck there...


#20

Erm.. 1D11E would be 0x1D11E  (...?)