BinaryData xml resource in wrong encoding

Hello.

I have embedded a font and few XML files as resource by using

juce_add_binary_data(my_resources SOURCES
"font/Jura-Light.ttf"
"Localization/en/localization_en.xml"
"Localization/ru/localization_ru.xml" ) 

The last file has cyrillic letters and xml encoding windows-1251 (and file saved in windows 1251 encoding):

<?xml version="1.1" encoding="windows-1251" ?>

but instead of generation cyrillic symbols in a BinaryData, it generates some unicode characters:

namespace BinaryData
{

//================== localization_ru.xml ==================
static const unsigned char temp_binary_data_2[] =
"<?xml version=\"1.1\" encoding=\"windows-1251\" ?>\n"
"<root>\n"
"    <Word key=\"Hello\">Hello</Word>\n"
"    <Word key=\"Scan\">\xd1\xea\xe0\xed\xe8\xf0\xee\xe2\xe0\xf2\xfc</Word>\n"
"    <Word key=\"Cancel\">\xce\xf2\xec\xe5\xed\xe0</Word>\n"
"</root> ";

const char* localization_ru_xml = (const char*) temp_binary_data_2;
}

And when running

parseXML(BinaryData::localization_ru_xml)

in the debug mode I get assertion:

... trying to create a string from 8-bit data ...

Why?

juce::parseXML takes a juce::String, but here you’re passing it a pointer to some raw binary data (hence the error message about creating a string from 8-bit data). You probably want to use juce::String::createStringFromData().

As for decoding the unicode characters, you’ll possibly need to use juce::CharPointer_UTF8 (or one of the other char pointer classes) to parse them correctly.

Well, createStringFromData removes assertion, but I still get crappy xml with unicodes.
And whatever I try to do, it does not encode properly, and I get the same crappy result:

CharPointer_UTF8(xml_elm->getAllSubText().getCharPointer()); //"Ñêàíèðîâàòü"
CharPointer_UTF8(xml_elm->getAllSubText().toRawUTF8()); //"Ñêàíèðîâàòü"
xml_elm->getAllSubText(); //"Ñêàíèðîâàòü"

I believe cmake function juce_add_binary_data generates with some bugs…

I’m not sure JUCE supports this encoding - you’d need to convert the file to UTF8 that the binary data generator can then embed properly.

Yes, already tried. No luck :frowning:

Can you share the original file?

localization_ru_utf-8.txt (189 Bytes)
Thank you for helping me.

Here is the file. I changed extension to txt. Xml is not allowed format to upload here.

I tried to save the file in different encodings, but with no luck.

UPDATE: if I replace utf8 chars in generated data with original cyrillic words manually, then it works:

 String str =String::fromUTF8(BinaryData::localization_ru_xml);
        xml = parseXML(str);

Looks like juce BinaryData generation does not support different encodings, and I have to create embedded files by my own…

How are you actually printing/viewing these string values? Are you using DBG, or std::cout, or drawing the strings into a Component, or something else?

I just tried encoding the xml file into binary data and then parsing it:

const auto xml = juce::XmlDocument::parse (juce::CharPointer_UTF8 (BinaryData::thefile_txt));

Then, I’m able to loop through the elements and print their strings with no issues:

for (const auto element : xml->getChildIterator())
    DBG (element->getAllSubText());

I tried this on mac and windows, and it seemed to work correctly in both cases. Perhaps the method you are using to print the strings is not using the correct encoding.

I don’t think that’s true. The BinaryData generator just reads the bytes it is given and converts them to a C++ source file representing arrays of bytes. It’s up to the program reading bytes from those arrays to interpret and display the bytes using the correct encoding.

1 Like

I was going to point out that the binary data generator doesn’t read the XML file’s encoding property so just changing that property won’t affect how the embedded data is generated.

Looked to me like the file you provided was already UTF-8 encoded, and I also had no trouble getting it to display properly in a label.

1 Like

@ImJimmi @reuk Thank you guys.
xml = juce::parseXML(juce::CharPointer_UTF8(BinaryData::localization_ru_xml));
works.
I think the key point was to use CharPointer_UTF8. I used it after xml was loaded.

Now everything is fine.

Thank you again.