XML Special Characters

Hi!,

There is something I don't understand and for sure I am missing..

I want to save an XML with special characters inside (like æ, ø, etc).
I would use "utf-8" encoding, which in theory supports that.

But when XMLElement class writes the file, this "escapes" all that characters.
My expected behaviour is to get the file without escaping that. Is it possible?

 


Example:
If I execute this:

XmlElement elementList("RootTree");
XmlElement* element = new XmlElement("Element1");

element->addTextElement(String("aeiouåøæñ"));

elementList.addChildElement(element);
elementList.writeToFile( File::getCurrentWorkingDirectory().getChildFile ("test.xml"), "", "UTF-8");

I have this (in test.xml file):

<?xml version="1.0" encoding="UTF-8"?>

<RootTree>
  <Element1 name1="value1">&#241;&#241;p&#229;p&#248;</Element1>
</RootTree>

But I expect to have something like this:

<?xml version="1.0" encoding="UTF-8"?>

<RootTree>
  <Element1 name1="value1">aeiouåøæñ</Element1>
</RootTree>


 

 

Thanks!

 

String("aeiouåøæñ")

<sigh> See this kind of thing so often... It's not safe to embed unicode in C++ source files - how could the compiler know what encoding to read your source file in, or what encoding to use when converting these characters into an 8-bit const char* ? Always use the Introjucer's UTF8 string literal creator tool to generate safe string literals.

And yes, the XML stuff escapes all the characters down to ascii, to make sure that it can be read by any reader without problems. That's intentional, and the result will be correct once it has been loaded.

Yes, you are completely right. That was a sample to explain the behaviour. In the app I have, the strings are taken from the file directory, where some folder-names contains those chars (it's user data, basically). Thanks for the explanation anyway.

BTW, if I understand rightly there is not a way to unescape those chars using juce, right?
 

there is not a way to unescape those chars using juce, right?

If you mean "generate raw XML output in which the strings do not use escaping" then there is no way to do that.

If you mean "unescape those strings when reading the XML" then of course they'll get unescaped automatically by the parser.

Ok, thanks!

Can't you save the xml-file as utf-8?

I would have expected the same result as ivanslo did, namely an xml file encoded with utf-8 with human readable characters, not charachter codes.

It doesn't really make sense to me to utilize a character coding as utf-8 that's mainly invented to allow storing and displaying all sorts of characters from non-english languages, and end up with a file that's hardly readable!

After all, it's hard to expect anyone to have a unicode translator at hand just to enter a file name like Größenmaßstäbe.txt in an xml-settings file. Or Möbelträgerfüße.doc. Let alone to be able to read such a file correctly.
 

1 Like

Producing human-friendly output was never the intention of the XML classes - I designed them for storing and retrieving data without any mistakes, and IMHO the only foolproof way to do that is to only use ascii + escapes.

Hi jules.

ascii + escapes.

   Might put it optional

This threw me for a loop as well. XML files are most commonly seen in the wild with the UTF-8 character encoding declared in the header, so I did not expect the XmlElement’s String export methods (previously createDocument, now toString) to automatically reduce the text encoding down to ASCII.

I understand Jules’ desire to make the storage and retrieval of data foolproof, but I think it’s a mistake not to mention in the docs the ASCII escaping of Unicode characters as numeric character references.

For others who might be grappling with the same issue, I’ll mention that a String generated by XmlElement::getAllSubText does NOT have its characters escaped down to ASCII. So you can at least grab individual UTF-8 strings from an XmlElement that way…

I must say this behaviour is most inconvenient. There’s a way to store ValueTrees as binary and there’s a way to store ValueTrees as text - only that the text is scrambled with escapes, whereas proper handling of unicode should work perfectly fine. I understand unicode often introduces other problems but in store/recall it should be just fine and escaping should be at least optional.