Write & read emoji to/from file using json

alexvoina · July 12, 2024, 10:54am

Is it possible to write emojis like this one to a file using juce::JSON and then read it back?

I’m reading a string from audio file tags with TagLib and it seems to work without issues when copying it to a std::string or a juce::String.

However, when I try to write the string containing the emoji to a file using Json it gets converted to \ud83c\uddf3\ud83c\uddf4 and when I read it back from the file it ends up like this \xed\xa0\xbc\xed\xb7\xb3\xed\xa0\xbc\xed\xb7\xb4

// example
// title is a std::string that contains 🇧🇻
json->setProperty("title", juce::String(title));

// this is how i write it to file 
juce::JSON::writeToStream(*outputStream, json);

// 🇧🇻 appears as \ud83c\uddf3\ud83c\uddf4 in the file

// i read the file like this
juce::JSON::parse(fileInputStream)

// and set the std::string like this 
title = json.getProperty("title", "unknown title").toString().toStdString();

// instead of 🇧🇻 i see this \xed\xa0\xbc\xed\xb7\xb3\xed\xa0\xbc\xed\xb7\xb4

If for instance i write this character ‘á’ it works. It is written as \u00e1 and read back as ‘á’. I’ve tried converting things to UTF8 and using different JUCE helpers & constructors, but nothing worked. I surrender. What’s the catch?

Please don’t tell me the answer is in this article: Technical Deep Dive: Unicode Literals - JUCE

attila · July 12, 2024, 11:37am

If you write

File { "C:/Users/MyUser/Desktop/someTmpFile.txt" }
    .replaceWithText (juce::String(title), true);

where title is the same std::string that you used with setProperty, and then open that file in e.g. VS Code (or a modern text editor) do you see the emoji?

If yes, this could be a bug with juce::JSON, and we’ll look into it. I’m not seeing other issues with your code.

alexvoina · July 12, 2024, 1:09pm

Thanks for the quick reply @attila

nope, if I write it to a text file as you suggested VS code gives me a warning about an unsupported encoding type. I’ve attached the file
test.txt (138 Bytes)

anthony-nicholls · July 12, 2024, 2:45pm

Are you using the MSVC compiler? if so, have you tried adding the /utf-8 compiler flag? The unsupported encoding type makes me suspicious.

anthony-nicholls · July 12, 2024, 2:47pm

Looking at your code example I’m not sure it would make a difference but it’s probably worth eliminating it as a possibility first.

alexvoina · July 12, 2024, 3:10pm

thanks @anthony-nicholls for your suggestion. No, I’m on macOS

anthony-nicholls · July 12, 2024, 3:36pm

I just ran the following in Xcode without issue I can run it multiple times to write over the output.

How are you initially loading into the title variable?

EDIT: Correction I see the issue.

alexvoina · July 12, 2024, 3:41pm

@anthony-nicholls so did you find an issue on your end as well?

I read it from the title audio file tag using taglib.

Here’s the audio file

anthony-nicholls · July 15, 2024, 3:15pm

@alexvoina my finding is that simply loading the following JSON

{
  "flag": "🇧🇻"
}

and resaving it, results in

{
  "flag": "\ud83c\udde7\ud83c\uddfb"
}

At first this seemed like an issue to me however, it is legal according to the JSON specification but it does also state that it’s not required to escape all multi-byte characters.

Any character may be escaped

To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a 12-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
“\uD834\uDD1E”.

What I can’t seem to reproduce is the string of text "\xed\xa0\xbc\xed\xb7\xb3\xed\xa0\xbc\xed\xb7\xb4"

I’ve tried following your code as closely as possible, below is an example I’ve tried, but I’m not seeing the issue with it. Could you possibly share a more complete example?

int main (int argc, char* argv[])
{
    const juce::StringArray args (argv, argc);
    const juce::File inFile (args[1]);
    const juce::File outFile (args[2]);

    const auto jflag = juce::JSON::parse (inFile).getProperty ("flag", "").toString();
    const auto flag = jflag.toStdString();

    auto json = std::make_unique<juce::DynamicObject>();
    json->setProperty ("flag", juce::String (flag));

    if (auto outputStream = outFile.createOutputStream())
    {
        outputStream->setPosition (0);
        outputStream->truncate();
        juce::JSON::writeToStream (*outputStream, json.release());
    }

    return 0;
}

alexvoina · July 15, 2024, 4:51pm

@anthony-nicholls
That’s weird…

frist of all my flag’s escaped sequence looks a bit different

\ud83c\uddf3\ud83c\uddf4 - mine
\ud83c\udde7\ud83c\uddfb - yours

I see that you read the json and resaved it and the flag has turned into that sequence. Have you tried reading the json a 2nd time, to see if the sequence turns back into a flag (e.g. in Clion debugger when inspecting the variable)? This is the step, where I get the “\xed\xa0\xbc\xed\xb7\xb3\xed\xa0\xbc\xed\xb7\xb4” instead of seeing ideally the flag, or at least the same sequence.

I even tried replacing the sequence in my case with yours. Doesn’t work. Also, I have updated to JUCE 8 because I was on 7.0.5 which was quite old. Still does not work.

I can avoid writing the emojis into the JSON, but I would’ve liked to know why this happens. I’m running out of ideas.

anthony-nicholls · July 15, 2024, 5:54pm

Interesting observation. I’m not 100% what causes this.

If I use this unicode escape encoder/decoder then I get the following…

"🇧🇻" -> "\uD83C\uDDE7\uD83C\uDDFB"

"\ud83c\uddf3\ud83c\uddf4" -> "🇧🇻"
"🇧🇻" -> "\uD83C\uDDE7\uD83C\uDDFB"

Looking a little further this tool shows that

\ud83c\udde7\ud83c\uddfb is
- REGIONAL INDICATOR SYMBOL LETTER B
- REGIONAL INDICATOR SYMBOL LETTER V
\ud83c\uddf3\ud83c\uddf4 is
- REGIONAL INDICATOR SYMBOL LETTER N
- REGIONAL INDICATOR SYMBOL LETTER O

Maybe both show the same character and something about the forum means it gets converted before I copy it?

Have you tried reading the json a 2nd time

Yes once I go from UTF-8 to escaped UTF-16 surrogate pairs, it doesn’t change the result is bit for bit identical.

This is the step, where I get the “\xed\xa0\xbc\xed\xb7\xb3\xed\xa0\xbc\xed\xb7\xb4” instead of seeing ideally the flag, or at least the same sequence.

I wonder if this is the CLion debugger more than anything?

alexvoina · July 16, 2024, 8:26am

I wrote a separate test & copied your code exactly & it seems like the /xY/xY/xY sequence is a Clion thing, because when i write it to a file it always ends up as \ud83c\udde7\ud83c\uddfb

What I don’t undertand is if “\ud83c\udde7\ud83c\uddfb” is part of the UTF8 encoding, or UTF16

I have a flutter application, where I convert the const char* in C++ (the pointer returned by toStdString.c_str()) into a Dart string using an UTF8 decoder & I get the following exception “FormatException: Encoded surrogate”.

If I read the string containing the flag from TagLib & pass it to the dart utf8 decoder, it does not throw any exception & I can display the flag without issues.

It seems like this translation that is done by JUCE is causing me problems. Can you display the flag if you print the string (read as \ud83c\udde7\ud83c\uddfb) to the console?

anthony-nicholls · July 16, 2024, 3:32pm

The string itself is UTF-8 encoded (it’s valid ascii), but the contents of the string include an escape sequence that encodes two UTF-16 surrogate pairs.

\ud83c\udde7: REGIONAL INDICATOR SYMBOL LETTER B
\ud83c\uddfb: REGIONAL INDICATOR SYMBOL LETTER V

One way to think about this is that the literal will take up just 4 bytes of data in the file, whereas the UTF-16 surrogate pairs take up 4 * 6 bytes (because each of the four bytes required by the literal are stored using 6 ascii/1 byte characters).

So the storage is 6 times larger, but the benefit is that the whole file effectively becomes valid ascii with practically no concerns about the encoding of the file itself.

That sounds like the UTF-8 decoder is not fully implemented. The spec is quite clear it says

Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character’s code point. The hexadecimal letters A through F can be uppercase or lowercase. So, for example, a string containing only a single reverse solidus character may be represented as “\u005C”.

To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a 12-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as “\uD834\uDD1E”.

In short

Any character may be escaped
The escape sequences are in the style \uxxxx
If the character being escaped is in the Basic Multilingual Plane (16-bits of less) then a single 6 character sequence in the above format can be used
If the character being escaped is not in the Basic Multilingual Plane (> 16-bits) then pairs of 6 character sequences (in the above format) can be used (these are called surrogate pairs).

That being said I may look at extending the behaviour in JUCE so that we can support saving JSON files with unicode literals as it seems to be such common practice now, but I don’t want to make any promises just yet.

That makes sense, the error is complaining about the escaped surrogate pairs, but in this instance there is just a UTF-8 literal, there are no escaped surrogate pairs to complain about.

I can print it to the console in Xcode and it prints it as a flag.

However, looking at lldb I can see it does display the string as a hex escaped string.

After spending a lot more time debugging this, I think what is happening is that when the JUCE JSON parser encounters non-ascii it stores the UTF-16 byte representation in the juce::String. The debugger then displays these values as a hex escaped string.

When the JSON is later written to disk it detects these values and converts them back to escape sequences.

If I’m right it means that while the JSON written to disk may well be valid, I’m not sure the string you get from the property is, as it seems like it’s a mix of UTF-8 and UTF-16?

Let me explore this a bit more and get back to you because I want to confirm those findings first.

To be clear though I don’t think this is your problem, I think the problem you’re facing is that your decoder doesn’t appear to support encoded surrogate pairs?

alexvoina · July 17, 2024, 6:05am

@anthony-nicholls thanks for the explanation! all is very clear now Yes, that seems to be my problem and, if anything, I should move to the Flutter github and open an issue there.

I found a flag in the dart utf8 decoder code ‘allow malformed’ which when set to true seems to avoid throwing the exception.

I’m ready to move on, but if you have any extra findings I will be curious to learn more. Thanks for all the help

anthony-nicholls · September 9, 2024, 10:00am

@alexvoina following the issues I found investigating this I made some improvements which have now made it onto the develop branch.

In your case the character was being stored using CESU-8 encoding you should now see the encoding is UTF-8 as expected when debugging. We’ve also changed the default JSON encoding to UTF-8 rather than using ASCII/UTF-16 escape sequences (although you can opt-in to ASCII if you prefer).

Topic		Replies	Views
String representation General JUCE discussion	40	3272	December 12, 2012
Saving songs in JSON General JUCE discussion	4	906	February 15, 2022
Parsing JSON fails with escaped UTF General JUCE discussion	4	1467	March 28, 2019
UTF8 Latin1 tildes problem generating JSON General JUCE discussion	4	1374	April 27, 2015
Using unicode in Juce General JUCE discussion	6	1529	September 28, 2009

Write & read emoji to/from file using json

Purchase

Discover

Learn

Support

About

Events

Write & read emoji to/from file using json

Related topics

Purchase

Discover

Learn

Support

About

Events