The string itself is UTF-8 encoded (it’s valid ascii), but the contents of the string include an escape sequence that encodes two UTF-16 surrogate pairs.
One way to think about this is that the literal will take up just 4 bytes of data in the file, whereas the UTF-16 surrogate pairs take up 4 * 6 bytes (because each of the four bytes required by the literal are stored using 6 ascii/1 byte characters).
So the storage is 6 times larger, but the benefit is that the whole file effectively becomes valid ascii with practically no concerns about the encoding of the file itself.
That sounds like the UTF-8 decoder is not fully implemented. The spec is quite clear it says
Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character’s code point. The hexadecimal letters A through F can be uppercase or lowercase. So, for example, a string containing only a single reverse solidus character may be represented as “\u005C”.
To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a 12-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as “\uD834\uDD1E”.
In short
- Any character may be escaped
- The escape sequences are in the style
\uxxxx
- If the character being escaped is in the Basic Multilingual Plane (16-bits of less) then a single 6 character sequence in the above format can be used
- If the character being escaped is not in the Basic Multilingual Plane (> 16-bits) then pairs of 6 character sequences (in the above format) can be used (these are called surrogate pairs).
That being said I may look at extending the behaviour in JUCE so that we can support saving JSON files with unicode literals as it seems to be such common practice now, but I don’t want to make any promises just yet.
That makes sense, the error is complaining about the escaped surrogate pairs, but in this instance there is just a UTF-8 literal, there are no escaped surrogate pairs to complain about.
I can print it to the console in Xcode and it prints it as a flag.
However, looking at lldb I can see it does display the string as a hex escaped string.
After spending a lot more time debugging this, I think what is happening is that when the JUCE JSON parser encounters non-ascii it stores the UTF-16 byte representation in the juce::String. The debugger then displays these values as a hex escaped string.
When the JSON is later written to disk it detects these values and converts them back to escape sequences.
If I’m right it means that while the JSON written to disk may well be valid, I’m not sure the string you get from the property is, as it seems like it’s a mix of UTF-8 and UTF-16?
Let me explore this a bit more and get back to you because I want to confirm those findings first.
To be clear though I don’t think this is your problem, I think the problem you’re facing is that your decoder doesn’t appear to support encoded surrogate pairs?