New String problem with high-ascii values


#1

This produces the wrong result:

String s("÷2");

I believe this is the problem:

juce_String.cpp:

String::String (const char* const t)
    : text (StringHolder::createFromCharPointer (CharPointer_UTF8 (t)))
{
}

const char* const t is not really a utf-8 encoded string, its an ASCII string.

I think the solution is to just introduce CharPointer_ASCII, with its obvious implementation: each ASCII character maps 1:1 to a UTF-32 code point of the same value, and change the constructor for String(const char* const t).

I’m not 100% sure about all of this though, so don’t hesitate to correct me.


#2

It’s only in ascii because your source-code editor saved the file as ascii. If it had saved it as utf-8, it’d work just fine.

Sadly, there’s only one portable way to encode strings with characters above 0x7f, and that’s by using escaped utf-8 character codes inside char* literals. There’s no other way to reliably get your string from the editor into the compiler and then into the code without risking the encoding being lost somewhere along the way. That’s why I made the String class assume it’s getting utf-8 - the alternative would be to assume that it’s ascii or a local encoding, and they’re worse options.

I’ve already been thinking about adding a CharPointer_ASCII class, and my cunning plan is to make the String (const char*) constructor slightly special - it’d assume the string it’s getting is unambiguously ascii, so that if you tried to feed it a value above 0x7f, it’d throw an assertion. That would force you to use a different constructor for extended strings, so in your case you’d have to write String (CharPointer_ASCII (“÷2”)). That would mean that it’s the coder’s responsibility to explicitly wrap these strings in an encoding that matches their source-file format.


#3

I tried changing the source file with the high-ascii constant to save as “Unicode (UTF-8 with signature) - Codepage 65001” and it didn’t help…

Not sure what to do here.


#4

Be careful adding the BOM to source files, as apparently gcc doesn’t handle it correctly.

The only “correct” thing to do would be to escape it as utf-8, which will at least guarantee it’ll work everywhere.


#5

This worked:

b->getFacade().setTextLabel (TRANS("\xC3\xB7" "2"));